Member since
02-16-2022
9
Posts
2
Kudos Received
0
Solutions
12-22-2023
08:47 AM
Hello, it's not clear to me how users are managed in CDP Public Cloud / Machine Learning / Cloudera Data Visualization. It's ok I can create a new user from the the DataViz menu, but in this case, how are they linked to users that I manage from Management Console / User Management in Control Pane? In DataViz Users & Groups, do I have to create a new user with same workload username and password assigned to the user in the User Management section? Are these users synchronized somehow? I'm asking because once a user has been created in DataViz, I am not able to change its password, so I'm wondering if somehow they are synchronized and can be managed from the User Management section in Management Console. Thank you, Andrea
... View more
11-02-2023
04:01 PM
@DianaTorres, yes, thank you! @aakulovthank you for your answer! Andrea
... View more
10-27-2023
03:55 AM
Hello, due to a lack of memory on a DE cluster in DataHub while executing Spark jobs, we are evaluating the resizing of worker/compute nodes with more performant EC2 instance (no need to add more nodes, just increase the available memory). At the moment we have 3 worker nodes and 4 compute nodes, running on m5.4xlarge instances. Is it better to resize both the nodes type or we can resize, for example, only the compute nodes? Thank you, Andrea
... View more
Labels:
- Labels:
-
Cloudera Data Platform (CDP)
08-31-2023
02:11 AM
Hello @smruti , unfortunately, I have limited access to the cluster (no CM access) and not able to create a support case at the moment. I suppose log analysis should be performed on HS2 log only, correct? Thank you, Andrea
... View more
08-29-2023
12:34 AM
1 Kudo
Hello@VidyaSargur, the solution provided by @smruti hides the warning message, and I'm fine with that, but did not solve the issue that generate the message. Thank you, Andrea
... View more
08-17-2023
03:29 AM
Thank you @smruti. Yes, it happens each time I run the spark-submit. I will make a test with LogLevel set to ERROR, and keep looking for a solution. Thank you, Andrea
... View more
08-16-2023
02:08 AM
1 Kudo
Hello, we are performing Hive queries with PySpark using the HWC in JDBC_CLUSTER mode. Everything is running fine and we get the results for the queries, but we also receive a warning message saying that connection has been closed: 23/08/16 09:59:05 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist 23/08/16 09:59:05 WARN transport.TIOStreamTransport: Error closing output stream. java.net.SocketException: Connection or outbound has closed at sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1181) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at org.apache.thrift.transport.TIOStreamTransport.close(TIOStreamTransport.java:110) at org.apache.thrift.transport.TSocket.close(TSocket.java:235) at org.apache.thrift.transport.TSaslTransport.close(TSaslTransport.java:400) at org.apache.thrift.transport.TSaslClientTransport.close(TSaslClientTransport.java:37) at org.apache.hadoop.hive.metastore.security.TFilterTransport.close(TFilterTransport.java:52) at org.apache.hive.jdbc.HiveConnection.close(HiveConnection.java:1153) at org.apache.commons.dbcp2.DelegatingConnection.closeInternal(DelegatingConnection.java:239) at org.apache.commons.dbcp2.PoolableConnection.reallyClose(PoolableConnection.java:232) at org.apache.commons.dbcp2.PoolableConnectionFactory.destroyObject(PoolableConnectionFactory.java:367) at org.apache.commons.pool2.impl.GenericObjectPool.destroy(GenericObjectPool.java:921) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:468) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365) at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:134) at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1563) at com.hortonworks.spark.sql.hive.llap.JDBCWrapper.getConnector(HS2JDBCWrapper.scala:481) at com.hortonworks.spark.sql.hive.llap.DefaultJDBCWrapper.getConnector(HS2JDBCWrapper.scala) at com.hortonworks.spark.sql.hive.llap.util.QueryExecutionUtil.getConnection(QueryExecutionUtil.java:96) at com.hortonworks.spark.sql.hive.llap.JdbcDataSourceReader.getTableSchema(JdbcDataSourceReader.java:116) at com.hortonworks.spark.sql.hive.llap.JdbcDataSourceReader.readSchema(JdbcDataSourceReader.java:128) at com.hortonworks.spark.sql.hive.llap.JdbcDataSourceReader.<init>(JdbcDataSourceReader.java:72) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector.getDataSourceReader(HiveWarehouseConnector.java:72) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector.createReader(HiveWarehouseConnector.java:40) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$SourceHelpers.createReader(DataSourceV2Relation.scala:161) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:224) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:187) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl.executeJdbcInternal(HiveWarehouseSessionImpl.java:295) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl.sql(HiveWarehouseSessionImpl.java:159) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Once thrown, execution continues and ends with no errors or missing data. The spark-submit command is the following: spark-submit --master yarn --driver-memory 1g --queue <queue_name> --conf spark.pyspark.python=/opt/venv/pdr/bin/python3.6 --conf spark.pyspark.driver.python=/opt/venv/pdr/bin/python3.6 --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar --py-files /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/pyspark_hwc-1.0.0.7.1.7.1000-141.zip /home/<path_to_Python_script>/script.py Configuration settings inside the Python script (script.py) are the following: from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.enableHiveSupport() \
.appName(appname) \
.config("spark.yarn.queue","<queue_name>") \
.config("spark.datasource.hive.warehouse.read.via.llap","false") \
.config("spark.sql.hive.hiveserver2.jdbc.url.principal","hive/_HOST@<domain>") \
.config("spark.datasource.hive.warehouse.read.mode","JDBC_CLUSTER") \
.config("spark.sql.extensions","com.hortonworks.spark.sql.rule.Extensions") \
.config("hive.support.concurrency", "true") \
.config("hive.enforce.bucketing","true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("hive.txn.manager","org.apache.hadoop.hive.ql.lockmgr.DbTxnManager") \
.config("hive.compactor.initiator.on", "true") \
.config("hive.compactor.worker.threads","1") \
.config("hive.tez.container.size", "12288") \
.config("tez.queue.name","<queue_name>") \
.config("mapred.job.queuename","<queue_name>") \
.config("spark.executor.core",3) \
.config("spark.executor.memory","6g") \
.config("spark.shuffle.service.enabled","true") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.dynamicAllocation.minExecutors",0) \
.config("spark.dynamicAllocation.initialExecutors",1) \
.config("spark.dynamicAllocation.maxExecutors",20) \
.config("spark.kryo.registrator","com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator") \
.config('spark.kryoserializer.buffer.max', '128m')\
.config('spark.sql.autoBroadcastJoinThreshold', -1)\
.config("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://<hive2_jdbc_URL>:10000/default;tez.queue.name=<queue_name>;ssl=true") As said, script is correctly executed and results are returned. Changing driver-memory and/or spark.executor.core / spark.executor.memory does not change the fact that the warning is still thrown. Any idea? Thank you, Andrea
... View more
Labels:
- Labels:
-
Apache Hive
06-02-2022
02:51 PM
Hi JM, yes it's a kerberized environment. KRs, Andrea
... View more
06-02-2022
09:41 AM
Hello, due to a problem with a script, we have almost saturated the HDFS available space. I suppose this is caused by temporary Hive files that have not been cleaned up due to the abnormal termination of the script. I would like to check the /tmp/hive folder on HDFS but my user, that has administrative privileges, cannot access the /tmp/hive folder. Is there a way to check and clean such folder? Any help would be appreciated. KRs, Andrea
... View more
Labels: