About akapratwar

akapratwar · ‎09-29-2017

PROBLEM: Hive metastore isn’t updating for tables after insert into the table. This issue is observed with improper ‘numRows’ value with the describle formatted <tb_name> RESOLUTION: Hive stats are autogathered properly till an 'analyze table [tablename] compute statistics for columns' is run. Then it does not auto-update the stats till the command is run again. This is due to a known issue https://issues.apache.org/jira/browse/HIVE-12661 and has been fixed in HDP-2.5.0

akapratwar · ‎09-29-2017

PROBLEM: While trying to execute a simple pyspark script that is trying to select data from Hive transactional table stored in ORC format, customer is facing following exception. java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2378) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2384) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810) at org.apache.spark.sql.Dataset.head(Dataset.scala:2119) at org.apache.spark.sql.Dataset.take(Dataset.scala:2334) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0000045_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 50 more Caused by: java.lang.NumberFormatException: For input string: "0000045_0000" ROOT CAUSE: This is unsupported technology. Here is a quick link to the Apache jira https://issues.apache.org/jira/browse/SPARK-15348 RESOLUTION: Currently this can be resolved by using HIVE-LLAP from SPARK-LLAP. The feature is however still in Technical Preview and not made GA. There is no roadmap available for this issue yet from Hortonworks.

awolfe · ‎04-02-2018

Oracle LONG columns are a nasty biz. They work with practically no other SQL or PL/SQL data types. Depending on the actual version of the DBMS, VARCHAR2 values can get treated as LONG and cause errors like this.

akapratwar · ‎09-29-2017

This is an unsupported technology and a concept which hasn't been explored yet. There's no real modification time concept in object stores. It has just creation time, which is that of the observed time at the far end. If you upload a file to a remote timezone, you may get that as your time. The underlying issue here is not a bug. It is just a feature that distcp -update relies on using file checksums for comparing HDFS files, and (a) not all stores export their checksum through the Hadoop API (WASB does, s3a doesn't yet). In addition, because the checksums are different between blobstores and HDFS, you can't use checksum difference as a cue for files being changed. Note that this also occurs when trying to copy between HDFS encryption zones, as the checksums of the encrypted files will differ.

akapratwar · ‎06-30-2017

SYMPTOM : Hive query with group by clause stuck in reducer phase for a very long time having large amount of data ROOT CAUSE: This happens in the case when GROUPBY clause is not optimized. By default Hive puts the data with the same group-by keys to the same reducer. If the distinct value of the group-by columns has data skew, one reducer may get most of the shuffled data and will be stuck for a very long time on this reducer. WORKAROUND: In this case increasing the tez container memory will not help. We can avoid data skewness using the following properties before running the query, >set hive.tez.auto.reducer.parallelism=true >set hive.groupby.skewindata=true ; >set hive.optimize.skewjoin=true;

akapratwar · ‎06-30-2017

SYMPTOM Select statement fails for view with different ordering FAILING QUERIES: select id, dept, emp, fname from testview order by id, dept; select id, emp, dept, fname from testview order by id, dept; select emp, dept, id, fnamefrom testview order by id, dept; SUCCESSFUL QUERIES: select emp, fname, id, dept from testview order by id, dept; select emp, citystate, fname, dept from testview order by id, dept; select emp, fname, dept, id from testview order by id, dept; EXCEPTION: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating VALUE._col1 at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:86) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:343) ... 17 more Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.io.Text.set(Text.java:225) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryHiveVarchar.init(LazyBinaryHiveVarchar.java:47) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct.uncheckedGetField(LazyBinaryStruct.java:267) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct.getField(LazyBinaryStruct.java:204) at org.apache.hadoop.hive.serde2.lazybinary.objectinspector.LazyBinaryStructObjectInspector.getStructFieldData(LazyBinaryStructObjectInspector.java:64) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator._evaluate(ExprNodeColumnEvaluator.java:98) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:81) ... 18 more 2017-05-30 20:12:32,035 [INFO] [TezChild] |exec.FileSinkOperator|: FS[1]: records written - 0 2017-05-30 20:12:32,035 [INFO] [TezChild] |exec.FileSinkOperator|: RECORDS_OUT_0:0, ROOT CAUSE The exception is due to mismatch in the serialization and deserialization on hive table backed upon sequenceinput/sequenceinput file format. The serialization by LazyBinarySerDe from previous MapReduce job used different order of columns. When the current MapReduce job deserialized the intermediate sequence file generated by previous MapReduce job, it will get corrupted data from the deserialization using wrong order of columns by LazyBinaryStruct. The unmatched columns between serialization and deserialization is caused by SelectOperator's Column Pruning ColumnPrunerSelectProc. WORKAROUND 1] Create an orc table from sequence table as follows create table test_orc stored as orc as select * from testtable; 2] create table view. REFERENCE: https://issues.apache.org/jira/browse/HIVE-14564

akapratwar · ‎06-30-2017

SYMPTOM CREATE EXTERNAL TABLE test( id STRING, dept STRING) row format delimited fields terminated by ',' location '/user/hdfs/testdata/'; ROOT CAUSE The files under location provided while creating table are structured in following way /user/hdfs/testdata/1/test1 /user/hdfs/testdata/2/test2 /user/hdfs/testdata/3/test3 /user/hdfs/testdata/4/test4 RESOLUTION To make the subdirectories accessible set the following two properties before executing the create table statement set mapred.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true;

akapratwar · ‎06-25-2017

SYMPTOM : => This problem occurs in case of a partitioned table without any null partitions and contains approximately more than 600 columns in the table => Following stacktrace is observed in hive metastore logs Nested Throwables StackTrace: org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO "PARTITION_PARAMS" ("PARAM_VALUE","PART_ID","PARAM_KEY") VALUES (?,?,?) at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1056) at org.datanucleus.store.rdbms.scostore.JoinMapStore.put(JoinMapStore.java:307) at org.datanucleus.store.types.wrappers.backed.Map.put(Map.java:653) at org.apache.hadoop.hive.common.StatsSetupConst.setColumnStatsState(StatsSetupConst.java:285) at org.apache.hadoop.hive.metastore.ObjectStore.updatePartitionColumnStatistics(ObjectStore.java:6237) at sun.reflect.GeneratedMethodAccessor118.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:103) at com.sun.proxy.$Proxy10.updatePartitionColumnStatistics(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitonColStats(HiveMetaStore.java:4596) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.set_aggr_stats_for(HiveMetaStore.java:5953) at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:139) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:97) at com.sun.proxy.$Proxy12.set_aggr_stats_for(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:11062) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:11046) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.postgresql.util.PSQLException: ERROR: value too long for type character varying(4000) at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:363) at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205) at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:393) at org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:431) at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1047) ... 30 more ROOT CAUSE: => Analyze table query updates the statistics in the metastore database => Metastore database has a limitation (4000) for the number of the characters that can be updated in the the PARTITION_PARAMS.PARAMS_VALUE => Hence, too many number of columns causes a limitation on the number of columns that can be updated WORKAROUND: Try increasing the column width for "PARTITION_PARAMS.PARAM_VALUE" column in metastore database. STEPS: 1] Stop metastore/HS2 2] Back up the DB 3] Try to increase the column width to a reasonable value. In case of Postgres database use the following command, ALTER TABLE PARTITION_PARAMS ALTER COLUMN PARAM_VALUE TYPE varchar(64000); 4] Start metastore/HS2 again.

akapratwar · ‎06-24-2017

SYMPTOM: Incorrect status shown for the DAGs in Tez UI ROOT CAUSE This is a known issue (https://issues.apache.org/jira/browse/TEZ-3656). It will only happen for the killed applications or if there was a failure to write into Application Timeline Server. It should not cause any issues, except for the wrong status for the DAG in the TezUI. RESOLUTION: This is fixed in HDP 2.6.1 release

akapratwar · ‎06-24-2017

PROBLEM DEFINITION: CREATE TABLE DT(Dérivation string, Pièce_Générique string); Throws ParserException Error ROOT CAUSE/ WORKAROUND: Hive database name, table name and/or column names cannot contain Unicode string. However, Hive supports UTF-8 and Unicode string for only the table data/comments. LINKS: https://cwiki.apache.org/confluence/display/Hive/User+FAQ

Online	Offline
Last Visited	‎06-21-2019 04:30 PM

Member Since	‎07-25-2016 07:47 PM
Last Visited	‎06-21-2019 04:30 PM
Posts	28
Kudos received	74

Cloudera Community

After inserting into table Hive Metastore is showi...

Error while trying to execute a simple pyspark scr...

Re: Hive table creation error

Can Distcp consider modification times when using ...

Hive query with group by clause stuck in reducer p...

SELECT queries fail on Hive table view

Hive create external table fails to load data

ANALYZE table COMPUTE STATISTICS for COLUMNS fail...

Tez UI shows incorrect status in HDP 2.5.3

Can we create hive table/ database name/ column na...