Member since
04-24-2017
106
Posts
13
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
761 | 11-25-2019 12:49 AM | |
1711 | 11-14-2018 10:45 AM | |
1432 | 10-15-2018 03:44 PM | |
1318 | 09-25-2018 01:54 PM | |
876 | 08-03-2018 09:47 AM |
11-25-2019
12:49 AM
1 Kudo
To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.
... View more
11-24-2019
11:18 PM
I wrote a Kafka producer, that sends some simulated data to a Kafka stream (replication-factor 3, one partition).
Now, I want to access this data by using Hive and/or Spark Streaming.
First approach: Using an external Hive table with KafkaStorageHandler:
CREATE EXTERNAL TABLE mydb.kafka_timeseriestest ( description string, version int, ts timestamp, varname string, varvalue float ) STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler' TBLPROPERTIES ( "kafka.topic" = "testtopic", "kafka.bootstrap.servers"="server1:6667,server2:6667,server3:6667" ); -- e.g. SELECT max(varvalue) from mydb.kafka_timeseriestest; -- takes too long, and only one Tez task is running
Second approach: Writing a Spark Streaming app, that accesses the Kafka topic:
// started with 10 executors, but only one executor is active ... JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)); ...
In both cases, only one Tez/Spark worker is active. Therefore reading all data (~500 million entries) takes a very long time. How can I increase the performance? Is the issue caused by the one-partition topic? If yes, is there a rule of thumb according to which the number of partitions should be determined?
I'm using a HDP 3.1 cluster, running Spark, Hive and Kafka on multiple nodes:
dataNode1 - dataNode3: Hive + Spark + Kafka broker
dataNode4 - dataNode8: Hive + Spark
... View more
Labels:
08-29-2019
05:28 AM
1 Kudo
I've upgraded to a HDP 3.1 and now want to read a Hive external table in my Spark application. The following table shows the compatibilites: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_configure_a_spark_hive_connection.html I don't have LLAP activated, so it seems that I'm restricted on the Spark -> Hive access and vice-versa, right? But the compatibility table sais, that I can access external Hive tables by Spark without using the HWC (and also without LLAP), but with the hint that the Table must be defined in Spark catalog. What do I have to do here? I tried the following code, but it sais Table not found! SparkSession session = SparkSession.builder() .config("spark.executor.instances", "4") .master("yarn-client") .appName("Spark LetterCount") .config("hive.metastore.uris", "thrift://myhost.com:9083") .config("hive.metastore.warehouse.dir", "/warehouse/tablespace/managed/hive") .config("hive.metastore.warehouse.external.dir", "/warehouse/tablespace/external/hive") .config("spark.sql.warehouse.dir", new File("spark-warehouse").getAbsolutePath()) .config("spark.sql.hive.hiveserver2.jdbc.url", "jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;user=student30") .enableHiveSupport(); Dataset<Row> dsRead = session.sql("SELECT * FROM hivedb.external_table"); System.out.println(dsRead.count()); Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `hivedb`.`external_table`; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation `hivedb`.`external_table` at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:84) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:84) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at main.SparkSQLExample.main(SparkSQLExample.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Can someone help me, to solve the issue? Thank you!
... View more
Labels:
02-20-2019
08:45 PM
Oh yes, you're right @Josh Elser. A closer look at the PhoenixDriver class showed that they define a shutdown hook when creating the driver instance: https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixDriver.java#L73 I was able to avoid the Exception when I stopped the JVM immediately after the SparkContext.close() call by calling Runtime.getRuntime().halt(0): ... Spark logic
...
updatePhoenixTableStats(phoenixTableName); // calling the Phoenix JDBC stuff
sc.close();
Runtime.getRuntime().halt(0); // Stopping the JVM immediately Do you know, is this a valid solution if I only use Spark and Phoenix functionalities?
... View more
02-20-2019
09:56 AM
I found the solution for my problem. The jar I exported included the phoenix-spark2 and phoenix-client dependencies and were included into my jar file. I changed these dependencies (as they are already existing in my clusters HDP installation) to be of scope provided: <dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark2</artifactId>
<version>4.7.0.2.6.5.0-292</version>
<scope>provided</scope> <!-- this did it, now have to add --jar to spark-submit -->
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-core</artifactId>
<version>4.7.0.2.6.5.0-292</version>
<scope>provided</scope> <!-- this did it, now have to add --jar to spark-submit -->
</dependency> Now I start my Spark job with the --jars option and link these dependencies there. Now it works fine in yarn-client mode. spark-submit --class spark.dataimport.SparkImportApp --master yarn --deploy-mode client --jars /usr/hdp/current/phoenix-client/phoenix-spark2.jar,/usr/hdp/current/phoenix-client/phoenix-client.jar hdfs:/user/test/gk-journal-importer-phoenix-0.0.3h.jar <some parameters for the main method> PS: In yarn-cluster mode the application worked all the time (also with the fat-jar that included the dependencies).
... View more
02-19-2019
07:01 PM
Thank you for the fast answer @Josh Elser, but I'm not sure what you mean with "not a JAR that HBase or Phoenix own". The gk-journal-importer-phoenix-0.0.3h.jar is my exported Java project. It contains the classes of my import job and the Main class with the main method, that creates, uses and closes the SparkContext. I start my Spark application by running the command spark-submit --class spark.dataimport.SparkImportApp --master yarn --deploy-mode client hdfs:/user/test/gk-journal-importer-phoenix-0.0.3h.jar <some parameters for the main method>
... View more
02-19-2019
01:48 PM
I wrote a Spark application for bulk-loading a Phoenix Table. Everything worked for a few weeks now, but for a few days I get some Problems with duplicated rows. This was caused by faulty table stats. However, a possible Workaround for that would be to delete and re-generate the stats for this table. Therefore I Need to open a JDBC Connection to my Phoenix database and call the Statements for deleting and creating the stats. Since I Need to do this after iserting the new data via Spark, I also want to create and use this JDBC Connection inside my Spark Job, after doing the table bulk-loading stuff. For that I added the following method and call it between the dataframe.save() and sparkContext.close() method in my Java Code: private static void updatePhoenixTableStatistics(String phoenixTableName) {
try {
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
System.out.println("Connecting to database..");
Connection conn = DriverManager.getConnection("jdbc:phoenix:my-sever.net:2181:/hbase-unsecure");
System.out.println("Creating statement...");
Statement st = conn.createStatement();
st.executeUpdate("DELETE FROM SYSTEM.STATS WHERE physical_name='" + phoenixTableName + "'");
System.out.println("Successfully deleted statistics data... Now refreshing it.");
st.executeUpdate("UPDATE STATISTICS " + phoenixTableName + " ALL");
System.out.println("Successfully refreshed statistics data.");
st.close();
conn.close();
System.out.println("Connection closed.");
} catch (Exception e) {
System.out.println("Unable to update table statistics - Skipping this step!");
e.printStackTrace();
}
} The Problem is, that since I added this method I Always get the following exception at the end of my Spark Job: Bulk-Load: DataFrame.save() completed - Import finished successfully!
Updating Table Statistics:
Connecting to database..
Creating statement...
Successfully deleted statistics data... Now refreshing it.
Successfully refreshed statistics data.
Connection closed.
Exception in thread "Thread-31" java.lang.RuntimeException: java.io.FileNotFoundException: /tmp/spark-e5b01508-0f84-4702-9684-4f6ceac803f9/gk-journal-importer-phoenix-0.0.3h.jar (No such file or directory)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2794)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2646)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2518)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1065)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1119)
at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1520)
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:68)
at org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:82)
at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:97)
at org.apache.phoenix.query.ConfigurationFactory$ConfigurationFactoryImpl$1.call(ConfigurationFactory.java:49)
at org.apache.phoenix.query.ConfigurationFactory$ConfigurationFactoryImpl$1.call(ConfigurationFactory.java:46)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:78)
at org.apache.phoenix.util.PhoenixContextExecutor.callWithoutPropagation(PhoenixContextExecutor.java:93)
at org.apache.phoenix.query.ConfigurationFactory$ConfigurationFactoryImpl.getConfiguration(ConfigurationFactory.java:46)
at org.apache.phoenix.jdbc.PhoenixDriver$1.run(PhoenixDriver.java:88)
Caused by: java.io.FileNotFoundException: /tmp/spark-e5b01508-0f84-4702-9684-4f6ceac803f9/gk-journal-importer-phoenix-0.0.3h.jar (No such file or directory)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:225)
at java.util.zip.ZipFile.<init>(ZipFile.java:155)
at java.util.jar.JarFile.<init>(JarFile.java:166)
at java.util.jar.JarFile.<init>(JarFile.java:103)
at sun.net.www.protocol.jar.URLJarFile.<init>(URLJarFile.java:93)
at sun.net.www.protocol.jar.URLJarFile.getJarFile(URLJarFile.java:69)
at sun.net.www.protocol.jar.JarFileFactory.get(JarFileFactory.java:99)
at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:122)
at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:152)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2612)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2693)
... 14 more Does someone know About this Problem and can help? How the generally work with JDBC inside a Spark Job? Or is there another possibility for doing that? I'm working on HDP 2.6.5 with Spark 2.3 and Phoenix 4.7 installed.
... View more
Labels:
12-19-2018
10:25 AM
I wrote different Spark applications, that save Dataset data to Phoenix tables. When I try to process huge datasets, some of my Spark Jobs fail by a ExecutorLostFailure exception. The jobs are retried and seem to finish successfully on their second approaches. Here the code, that saves the dataframe to my Phoenix table: dfToSave.write().format("org.apache.phoenix.spark").mode("overwrite").option("table", "PHOENIX_TABLE_NAME").option("zkUrl", "server.name:2181:/hbase-unsecure").save(); Here the output of one of the Jobs in the Spark History UI: ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 3.0 GB of 3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Why do I get this error when I use the Spark plugin for Apache Phoenix? Are there any configurations to manage the memory consumption of the Phoenix-Spark job?
... View more
Labels:
11-14-2018
10:45 AM
I found the following Java based solution for me: Using the Dataset.filter method with FilterFunction: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/Dataset.html So, my code now looks like this: Dataset<Row> dsResult = sqlC.read()
.format("org.apache.phoenix.spark")
.option("table", tableName)
.option("zkUrl", hbaseUrl).load()
.where("OTHER_COLUMN = " + inputId)
.filter(row -> {
long readTime = row.getTimestamp(row.fieldIndex("TABLE_TS_COL")).getTime();
long tsFrom = new Timestamp(sdf.parse(dateFrom).getTime()).getTime();
long tsTo = new Timestamp(sdf.parse(dateTo).getTime()).getTime();
return readTime >= tsFrom && readTime <= tsTo;
});
... View more
11-14-2018
08:10 AM
I have a Phoenix Table, that I can access via SparkSQL (with Phoenix Spark Plugin). The table has also a Timestamp column. I have to filter this Timestamp column by a user input, like 2018-11-14 01:02:03. So I want to filter my Dataset (that represents the read Phoenix table) with the where / filter methods. My actual Java code looks the following: Timestamp t1 = new Timestamp(sdf.parse(dateFrom).getTime());
Timestamp t2 = new Timestamp(sdf.parse(dateTo).getTime());
Column c1 = new Column("TABLE_TS_COL").geq(t1);
Column c2 = new Column("TABLE_TS_COL").leq(t2);
Dataset<Row> dsResult = sqlContext.read()
.format("org.apache.phoenix.spark")
.option("table", tableName)
.option("zkUrl", hbaseUrl).load()
.where("OTHER_COLUMN = " + inputId) // This works
.where(c1) // Problem!
.where(c2) // Problem!
But this leads to follwoing exception: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.phoenix.exception.PhoenixParserException: ERROR 604 (42P00): Syntax error. Mismatched input. Expecting "RPAREN", got "06" at line 1, column 474. My Spark History UI shows the following select statement: ...
18/11/14 08:54:58 INFO PhoenixInputFormat: Select Statement: SELECT "OTHER_COLUMN", "TABLE_TS_COL" FROM HBASE_TEST3 WHERE ( "OTHER_COLUMN" = 0 AND "OTHER_COLUMN" IS NOT NULL AND "TABLE_TS_COL" IS NOT NULL AND "TABLE_TS_COL" >= 2018-09-24 06:49:01.0 AND "TABLE_TS_COL" <= 2018-09-24 06:49:01.0)
For me it looks like the quotation marks are missing for the timestamp values (not sure about that)? How can I filter a Timestamp column by a user input in Java and SparkSQL?
... View more
Labels:
- Labels:
-
Apache Phoenix
-
Apache Spark
10-15-2018
03:44 PM
Solved it - Phoenix Arrays are 1-based, so using the following query solved it: SELECT REGEXP_SPLIT(ROWKEY, ':')[1] as test, count(1) FROM "my_view" GROUP BY REGEXP_SPLIT(ROWKEY, ':')[1]
... View more
10-15-2018
03:40 PM
I have a Phoenix View which has a row key column ROWKEY that has a layout like this: <hash>:<attributeA>_<attributeB> I want to count the rows of each <hash> value of my table. Therefore I need to group my View by the <hash> value, which I get when I split the RowKey column. I tried to use the REGEXP_SPLIT function of Phoenix, but I get an exception: %jdbc(phoenix)
SELECT REGEXP_SPLIT(ROWKEY, ':')[0] as test, count(1) FROM "my_view" GROUP BY REGEXP_SPLIT(ROWKEY, ':')[0]
The exception: org.apache.phoenix.exception.PhoenixIOException: org.apache.phoenix.exception.PhoenixIOException: org.apache.hadoop.hbase.DoNotRetryIOException: my_view,000,1539582312877.93d8d6e785eae60fedac3c6088b4e556.: 32767
at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:93)
at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:59)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:271)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1301)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1699)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1296)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2404)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32767
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:418)
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:379)
at org.apache.phoenix.expression.function.ArrayIndexFunction.evaluate(ArrayIndexFunction.java:64)
at org.apache.phoenix.util.TupleUtil.getConcatenatedValue(TupleUtil.java:101)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.scanUnordered(GroupedAggregateRegionObserver.java:418)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.doPostScannerOpen(GroupedAggregateRegionObserver.java:162)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:237)
... 11 more
at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:117)
at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:780)
at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:721)
at org.apache.phoenix.iterate.MergeSortResultIterator.getMinHeap(MergeSortResultIterator.java:72)
at org.apache.phoenix.iterate.MergeSortResultIterator.minIterator(MergeSortResultIterator.java:93)
at org.apache.phoenix.iterate.MergeSortResultIterator.next(MergeSortResultIterator.java:58)
at org.apache.phoenix.iterate.BaseGroupedAggregatingResultIterator.next(BaseGroupedAggregatingResultIterator.java:64)
at org.apache.phoenix.iterate.DelegateResultIterator.next(DelegateResultIterator.java:44)
at org.apache.phoenix.iterate.LimitingResultIterator.next(LimitingResultIterator.java:47)
at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:778)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
at org.apache.zeppelin.jdbc.JDBCInterpreter.getResults(JDBCInterpreter.java:510)
at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:694)
at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:763)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:101)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:502)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: org.apache.phoenix.exception.PhoenixIOException: org.apache.hadoop.hbase.DoNotRetryIOException: my_view,000,1539582312877.93d8d6e785eae60fedac3c6088b4e556.: 32767
at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:93)
at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:59)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:271)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1301)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1699)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1296)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2404)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32767
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:418)
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:379)
at org.apache.phoenix.expression.function.ArrayIndexFunction.evaluate(ArrayIndexFunction.java:64)
at org.apache.phoenix.util.TupleUtil.getConcatenatedValue(TupleUtil.java:101)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.scanUnordered(GroupedAggregateRegionObserver.java:418)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.doPostScannerOpen(GroupedAggregateRegionObserver.java:162)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:237)
... 11 more
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:775)
... 24 more
Caused by: org.apache.phoenix.exception.PhoenixIOException: org.apache.hadoop.hbase.DoNotRetryIOException: my_view,000,1539582312877.93d8d6e785eae60fedac3c6088b4e556.: 32767
at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:93)
at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:59)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:271)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1301)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1699)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1296)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2404)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32767
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:418)
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:379)
at org.apache.phoenix.expression.function.ArrayIndexFunction.evaluate(ArrayIndexFunction.java:64)
at org.apache.phoenix.util.TupleUtil.getConcatenatedValue(TupleUtil.java:101)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.scanUnordered(GroupedAggregateRegionObserver.java:418)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.doPostScannerOpen(GroupedAggregateRegionObserver.java:162)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:237)
... 11 more
at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:117)
at org.apache.phoenix.iterate.TableResultIterator.initScanner(TableResultIterator.java:252)
at org.apache.phoenix.iterate.ParallelIterators$1.call(ParallelIterators.java:113)
at org.apache.phoenix.iterate.ParallelIterators$1.call(ParallelIterators.java:108)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask.run(JobManager.java:183)
... 3 more
Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: org.apache.hadoop.hbase.DoNotRetryIOException: my_view,000,1539582312877.93d8d6e785eae60fedac3c6088b4e556.: 32767
at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:93)
at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:59)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:271)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1301)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1699)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1296)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2404)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32767
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:418)
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:379)
at org.apache.phoenix.expression.function.ArrayIndexFunction.evaluate(ArrayIndexFunction.java:64)
at org.apache.phoenix.util.TupleUtil.getConcatenatedValue(TupleUtil.java:101)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.scanUnordered(GroupedAggregateRegionObserver.java:418)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.doPostScannerOpen(GroupedAggregateRegionObserver.java:162)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:237)
... 11 more
at sun.reflect.GeneratedConstructorAccessor38.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:335)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:391)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:208)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:63)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:211)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
... 3 more
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: my_view,000,1539582312877.93d8d6e785eae60fedac3c6088b4e556.: 32767
at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:93)
at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:59)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:271)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1301)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1660)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1734)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1699)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1296)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2404)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32767
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:418)
at org.apache.phoenix.schema.types.PArrayDataType.positionAtArrayElement(PArrayDataType.java:379)
at org.apache.phoenix.expression.function.ArrayIndexFunction.evaluate(ArrayIndexFunction.java:64)
at org.apache.phoenix.util.TupleUtil.getConcatenatedValue(TupleUtil.java:101)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.scanUnordered(GroupedAggregateRegionObserver.java:418)
at org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver.doPostScannerOpen(GroupedAggregateRegionObserver.java:162)
at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:237)
... 11 more
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1227)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:218)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:292)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32831)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:383)
... 10 more
How is the correct query?
... View more
Labels:
- Labels:
-
Apache Phoenix
10-15-2018
02:47 PM
I want to access a HBase table by Phoenix in a Zepplin JDBC paragraph: %jdbc(phoenix)
SELECT COUNT(*) FROM myphoenixview For a smaller view it works perfectly, but when I run the query on a big view, I get a timeout exception: java.sql.SQLTimeoutException: Operation timed out.
at org.apache.phoenix.exception.SQLExceptionCode$14.newException(SQLExceptionCode.java:364)
at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:831)
at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:721)
at org.apache.phoenix.iterate.ConcatResultIterator.getIterators(ConcatResultIterator.java:50)
at org.apache.phoenix.iterate.ConcatResultIterator.currentIterator(ConcatResultIterator.java:97)
at org.apache.phoenix.iterate.ConcatResultIterator.next(ConcatResultIterator.java:117)
at org.apache.phoenix.iterate.BaseGroupedAggregatingResultIterator.next(BaseGroupedAggregatingResultIterator.java:64)
at org.apache.phoenix.iterate.UngroupedAggregatingResultIterator.next(UngroupedAggregatingResultIterator.java:39)
at org.apache.phoenix.iterate.DelegateResultIterator.next(DelegateResultIterator.java:44)
at org.apache.phoenix.iterate.LimitingResultIterator.next(LimitingResultIterator.java:47)
at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:778)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
at org.apache.zeppelin.jdbc.JDBCInterpreter.getResults(JDBCInterpreter.java:510)
at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:694)
at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:763)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:101)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:502)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) I set the following properties in my Ambari HBase Service Config (I'm using HDP 2.6.5 with HBase 1.1 and Phoenix 4.7): hbase.rpc.timeout = 180000 phoenix.query.timeoutMs = 180000 zookeeper.session.timeout = 180000 I set the same properties also for my JDBC and HBase interpreter in Zeppelin. The query still times out after 60 seconds. How can I avoid this? Where to set the properties?
... View more
Labels:
10-10-2018
06:16 AM
I have a huge table table1 in Hive, which contains around 60 million rows (~500 GB ORC files in HDFS). It is partitioned by the column partCol. Now I want to create a new table table2 in Hive, that has the same schema and shall contain only 50 million rows of table1. Therefore I run this query: set hive.exec.dynamic.partition=true;
INSERT OVERWRITE TABLE testdb.table2 partition(partCol) SELECT colA, colB, ..., partCol FROM testdb.table1 LIMIT 50000000; This creates a lot of Tez Mapper tasks, which looks and works fine - The tasks take around 1 h to finish. And now the problem: Afterwards there's only 1 Reducer Task, which runs for hours and then fails! How to increase this number of Reducer tasks for this query? Is the LIMIT clause the issue? System Information: I'm using the Hortonworks Data Platform 2.6.5 with Hive 1.2.1 The following Hive settings are configured: hive.execution.engine = TEZ hive.tez.auto.reducer.parallelism = true hive.exec.reducers.bytes.per.reducer = 64 MB hive.exec.reducers.max = 1009 Tez settings: tez.grouping.min-size = 16 MB tez.grouping.max-size = 1 GB
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
09-25-2018
01:54 PM
1 Kudo
The problem was solved after changing the MySQL Database URL from jdbc:mysql://xxxx.yyyy/hive?createDatabaseIfNotExist=true to jdbc:mysql://xxxx.yyyy/hive?createDatabaseIfNotExist=true&serverTimezone=Europe/Berlin I found the relevant information here: https://community.hortonworks.com/questions/218023/error-setting-up-hive-on-hdp-265timezone-on-mysql.html
... View more
09-25-2018
01:27 PM
I'm just setting up a Hortonworks Data Platform 3.0 installation. When I want to start the services (first time), the Hive Metastore start brings an exception: Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
Underlying cause: java.sql.SQLException : The server time zone value 'CEST' is unrecognized or represents more than one time zone. You must configure either the server or JDBC driver (via the serverTimezone configuration property) to use a more specifc time zone value if you want to utilize time zone support.
SQL Error code: 0
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
at org.apache.hadoop.hive.metastore.tools.HiveSchemaHelper.getConnectionToMetastore(HiveSchemaHelper.java:94)
at org.apache.hive.beeline.HiveSchemaTool.getConnectionToMetastore(HiveSchemaTool.java:169)
at org.apache.hive.beeline.HiveSchemaTool.testConnectionToMetastore(HiveSchemaTool.java:475)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:581)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:567)
at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1539)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
Caused by: java.sql.SQLException: The server time zone value 'CEST' is unrecognized or represents more than one time zone. You must configure either the server or JDBC driver (via the serverTimezone configuration property) to use a more specifc time zone value if you want to utilize time zone support.
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:63)
at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:73)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:76)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:832)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:456)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:240)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:207)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at org.apache.hadoop.hive.metastore.tools.HiveSchemaHelper.getConnectionToMetastore(HiveSchemaHelper.java:88)
... 11 more In my CentOS system I set the timezone to Europe/Berlin. ls -l /etc/localtime
lrwxrwxrwx. 1 root root 35 Sep 25 10:45 /etc/localtime -> ../usr/share/zoneinfo/Europe/Berlin
timedatectl | grep -i 'time zone'
Time zone: Europe/Berlin (CEST, +0200)
Does anyone know how to solve this problem? Thank you!
... View more
Labels:
09-14-2018
10:27 AM
@Felix Albani Thank you for your help! Without the LIMIT clause, the Job works perfectly (and in parallel).
... View more
09-12-2018
09:40 AM
I'm using the Spark HBase Connector (SHC, which is included in my HDP 2.6.5 installation) for moving data into HBase using Spark SQL. Since my dataframe (the cell values) is very huge, I sometimes get an "KeyValue size too large" exception. Therefore I found this article, that sais, that I have to increase the value of the HBase property "hbase.client.keyvalue.maxsize": https://community.hortonworks.com/content/supportkb/154006/errorjavalangillegalargumentexception-keyvalue-siz-1.html I changed this property in my HBase configuration via Ambari, this seems not to help. Now I tried to change this property in my application (Zeppelin notebook), therefore I need to give my SHC this property, but I don't know how to do this (setting the property in my Zeppelin HBase interpreter settings also didn't help). Can someone show how to change the property for the SHC? Is there a way to do this? Thank you!
... View more
Labels:
09-11-2018
02:06 PM
Here the extended explain of the DF, LLAP is not enabled in our cluster: df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [importtime: timestamp, tester: string ... 21 more fields]
== Parsed Logical Plan ==
GlobalLimit 1
+- LocalLimit 1
+- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
+- SubqueryAlias mytable
+- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc
== Analyzed Logical Plan ==
importtime: timestamp, tester: string, testerhead: string, matchcode: string, revision: string, usertext: string, teststufe: string, temp: string, filedat: date, starttime: timestamp, endtime: timestamp, lotnrc: string, wafernr: string, testname: string, testnumber: string, testtype: string, unit: string, highl: string, lowl: string, highs: string, lows: string, valuelist: string, hashvalue: int
GlobalLimit 1
+- LocalLimit 1
+- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
+- SubqueryAlias mytable
+- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc
== Optimized Logical Plan ==
InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 1
+- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user...
== Physical Plan ==
InMemoryTableScan [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
+- InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 1
+- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user...
<br>
... View more
09-11-2018
12:01 PM
I have a huge Hive Table (ORC) and I want to select just a few rows of the table (in Zeppelin). %spark sqlContext.setConf("spark.sql.orc.filterPushdown", "true") val df1 = sqlContext.sql("SELECT * FROM mydb.myhugetable LIMIT 1") // Takes 10 mins val df2 = sqlContext.sql("SELECT * FROM mydb.myhugetable").limit(1) // Takes 10 mins Using the LIMIT clause in my SQL statement or the corresponding dataframe method DF.limit doesn't help, as the query still takes too long. It seems to read the whole table first and then just returning the n rows. How can I achieve, that the filter limits the data during running the SQL and therefore runs faster? Shouldn't the filter pushdown help here? I can't see any difference with the setting spark.sql.orc.filterPushdown set to true or false.
Thank you!
... View more
Labels:
09-03-2018
07:15 AM
Thank you! Can you give me some details about this or do you have some helpful links?
... View more
09-01-2018
06:17 PM
1 Kudo
I have a huge Hive Table, which works fine so far. Now I want to play around with HBase, so I'm looking for a way to my Hive table data into a (new) HBase table. I already found some solutions for that, but I'm not sure which way is the best one. By the way, I'm familiar with Spark, so working with RDD / Datasets is not a problem. I'm using the Hortonworks Data Platform 2.6.5. SHC (Spark HBase Connector) Reading the Hive data into a Dataset by SparkSQL creating HBase table via HBase Shell defining a Catalog object that maps the Hive Columns to HBase ColumnFamilies and Qualifiers writing the data of the Dataset via df.write.options(...).format("org.apache.spark.sql.execution.datasources.hbase").save() Phoenix creating Phoenix table via JDBC reading Hive table data into Dataset via SparkSQL writing the Dataset via df.write.options(...).format("org.apache.phoenix.spark").save() Hive-HBase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) create HBase table create external Hive Table (as template for the HFile creation) set Hive properties for HFile creation set hfile.family.path=/tmp/my_test_table/cf set hive.hbase.generatehfiles=true move data from Hive to HBase by INSERT OVERWRITE TABLE ... SELECT FROM ... statement insert generated HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool Spark Native (https://www.opencore.com/de/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/) Reading the Hive table data into Dataset via SparkSQL Transforming the Dataset into PairRDD<ImmutableBytesWritable, KeyValue> save this RDD into HFiles by calling rdd.saveAsNewAPIHadoopFile importing HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool Are there other interesting ways to bulk load HBase by Hive data? Which way above is the most "common" one? Thank you for your help!
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
-
Apache Phoenix
08-15-2018
07:35 AM
I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code: %spark
val df1 = sqlContext.sql("select * from mydb.mytable limit 10")
df1.show()
That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes! %spark
val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a")
df2.show()
1. Is this normal behavior? 2. Is there a way to avoid full table scans, when using sub-queries? Thank you!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
Apache Zeppelin
08-08-2018
08:40 AM
1 Kudo
I created a Hive Table, that can access another HBase table. Something like this: create external table hbase_50mio_200regions (rowkey String, valuelist String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:valueList")
TBLPROPERTIES ('hbase.table.name' = 'hbase_50mio_200regions')
The access works and HiveQL queries are fast for small HBase table. The problems are the queries to huge HBase tables. It takes very long to finish e.g. a simple count (takes >1h on 50 Mio rows HBase table with about 1300 GB of HFiles)! Is there a way to pass start and stop key to my Hive SELECT or COUNT queries? As I saw in the following question https://stackoverflow.com/a/40729818 the WHERE conditions won't avoid a full table scan, as they only filter the rows after iterating all rows of the HBase table. Is this right and/or is there another way to avoid full table scans when accessing my HBase table through Hive?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
08-07-2018
02:14 PM
Is there a possibility to query a salted HBase table by using the HBase Shell. E.g. my HBase table row key looks like this: 000:rowkey0
000:rowkey2
...
001:rowkey1
001:rowkey3
...
How can I use my HBase Shell to query this table without doing a full table scan, e.g. when I want to read out the row with rowkey1? Thank you!
... View more
Labels:
- Labels:
-
Apache HBase
08-06-2018
02:14 PM
Yes, I'm already using Spark 2.2 (changed the configs of my %spark Interpreter)
... View more
08-06-2018
12:40 PM
@Josh Elser Thank you for that information. I changed my HBase table creation now to following command: create 'hbase_1m_10r', {NAME => 'cf'}, {SPLITS => ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}<br> When running the following query: INSERT OVERWRITE TABLE dmueller.hbase_1m_10r SELECT concat_ws(":", cast((hashvalue % 10) as String), concat_ws("_", name, number, test, step, cast(starttime as STRING))) as k, valuelist from (select * from testdb.test_orc limit 1000000) a distribute by split(k, ":")[0] sort by k<br> I still have only 1 reducer... Any idea why?
... View more
08-06-2018
07:28 AM
I setup a HDP 2.6 cluster, where I created a huge Hive table (50 mio rows, about 500 GB of ORC Files). Now I'm playing around in Zeppelin and query the data. Here I discovered the following behavior: When I run a query with the LIMIT clause and print the resulting dataframe, it takes ~3 seconds to finish it. Spark History UI shows 1 Task. %spark2
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10")
df.show() Afterwards, I added a second dataframe, just to play around with new columns, nothing specific (see query below). Now the query takes minutes and Spark History shows ~4000 Tasks (still with the LIMIT clause). %spark2
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10")
// With these rows it takes minutes!
val df2 = df.withColumn("valuelist_array", df.col("valuelist"))
df2.show()
Same behavior when I just call the cache method before showing the dataframe: Takes minutes and about 4000 Tasks (with LIMIT clause): %spark2
// Takes also minutes now, when I add the cache call!
sqlContext.clearCache()
val df = sqlContext.sql("SELECT valuelist FROM mydb.mytable limit 10").cache()
df.show()
Why is Spark using the 4000 tasks (seems like it does a full table scan) when I re-use the DataFrame or if I want to cache it? Thanks for your helpful explanations!
... View more
Labels:
- Labels:
-
Apache Spark
08-03-2018
10:03 AM
I have a Hive ORC Table with the following schema: CREATE TABLE mydb.orc_table (String tester, int counter, int hashvalue, String valuelist) STORED AS ORC;
Now I found the following tutorial about how to generate (and import) HFiles from Hive Table, to transfer the data into HBase: https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.html In this article they just create a new Hive table that is linked with a HBase table. That's also working fine for me: create table mydb.hbase_table(rowkey String, hashvalue int, valuelist String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:hashValue,cf:valueList");
In the next step I need to create the HFiles by running the following query: set hfile.family.path=/tmp/stdf_hbase/cf;
set hive.hbase.generatehfiles=true;
INSERT OVERWRITE TABLE mydb.hbase_table SELECT concat_ws("_", tester, counter) as key, hashvalue, valuelist from (select * from mydb.orc_table limit 1000000) a ORDER BY key, hashvalue;
In the example that's everything that is to do. But when I run this query the Tez task is running into the known "Added a key not lexically larger than previous [...]": Added a key not lexically larger than previous. Current cell = tester1_1/cf:hashValue/1533284948384/Put/vlen=3/seqid=0, lastCell = tester1_1/cf:valueList/1533284948384/Put/vlen=15231/seqid=0 I know, that this exception occurs because Tez tried to write the value of column "hashValue" after writing the "valueList" column value. As the data for HFile must be ordered by <Key>/<CF>:<CQ> I somehow must be able to write the hashValue always before writing the valueList for a row (as hashValue is lexically smaller than valueList). But how to do this? Why is the example working? Or another idea: Is it somehow possible to split the "hashValue" and "valueList" columns into two different Tez tasks (also not working for me for now, and it wouldn't be a good solution)? Thank you for the help
... View more
Labels:
08-03-2018
09:47 AM
Found following article: http://www.openkb.info/2017/05/hive-on-tez-how-to-control-number-of.html That helped me. I now set the numbers of Tasks to a fix amount.
... View more