About AndreaCavenago

AndreaCavenago · ‎12-22-2023

Hello, it's not clear to me how users are managed in CDP Public Cloud / Machine Learning / Cloudera Data Visualization. It's ok I can create a new user from the the DataViz menu, but in this case, how are they linked to users that I manage from Management Console / User Management in Control Pane? In DataViz Users & Groups, do I have to create a new user with same workload username and password assigned to the user in the User Management section? Are these users synchronized somehow? I'm asking because once a user has been created in DataViz, I am not able to change its password, so I'm wondering if somehow they are synchronized and can be managed from the User Management section in Management Console. Thank you, Andrea

AndreaCavenago · ‎11-02-2023

@DianaTorres, yes, thank you! @aakulovthank you for your answer! Andrea

AndreaCavenago · ‎10-27-2023

Hello, due to a lack of memory on a DE cluster in DataHub while executing Spark jobs, we are evaluating the resizing of worker/compute nodes with more performant EC2 instance (no need to add more nodes, just increase the available memory). At the moment we have 3 worker nodes and 4 compute nodes, running on m5.4xlarge instances. Is it better to resize both the nodes type or we can resize, for example, only the compute nodes? Thank you, Andrea

AndreaCavenago · ‎08-31-2023

Hello @smruti , unfortunately, I have limited access to the cluster (no CM access) and not able to create a support case at the moment. I suppose log analysis should be performed on HS2 log only, correct? Thank you, Andrea

AndreaCavenago · ‎08-29-2023

Hello@VidyaSargur, the solution provided by @smruti hides the warning message, and I'm fine with that, but did not solve the issue that generate the message. Thank you, Andrea

AndreaCavenago · ‎08-17-2023

Thank you @smruti. Yes, it happens each time I run the spark-submit. I will make a test with LogLevel set to ERROR, and keep looking for a solution. Thank you, Andrea

AndreaCavenago · ‎08-16-2023

Hello, we are performing Hive queries with PySpark using the HWC in JDBC_CLUSTER mode. Everything is running fine and we get the results for the queries, but we also receive a warning message saying that connection has been closed: 23/08/16 09:59:05 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist 23/08/16 09:59:05 WARN transport.TIOStreamTransport: Error closing output stream. java.net.SocketException: Connection or outbound has closed at sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1181) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at org.apache.thrift.transport.TIOStreamTransport.close(TIOStreamTransport.java:110) at org.apache.thrift.transport.TSocket.close(TSocket.java:235) at org.apache.thrift.transport.TSaslTransport.close(TSaslTransport.java:400) at org.apache.thrift.transport.TSaslClientTransport.close(TSaslClientTransport.java:37) at org.apache.hadoop.hive.metastore.security.TFilterTransport.close(TFilterTransport.java:52) at org.apache.hive.jdbc.HiveConnection.close(HiveConnection.java:1153) at org.apache.commons.dbcp2.DelegatingConnection.closeInternal(DelegatingConnection.java:239) at org.apache.commons.dbcp2.PoolableConnection.reallyClose(PoolableConnection.java:232) at org.apache.commons.dbcp2.PoolableConnectionFactory.destroyObject(PoolableConnectionFactory.java:367) at org.apache.commons.pool2.impl.GenericObjectPool.destroy(GenericObjectPool.java:921) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:468) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365) at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:134) at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1563) at com.hortonworks.spark.sql.hive.llap.JDBCWrapper.getConnector(HS2JDBCWrapper.scala:481) at com.hortonworks.spark.sql.hive.llap.DefaultJDBCWrapper.getConnector(HS2JDBCWrapper.scala) at com.hortonworks.spark.sql.hive.llap.util.QueryExecutionUtil.getConnection(QueryExecutionUtil.java:96) at com.hortonworks.spark.sql.hive.llap.JdbcDataSourceReader.getTableSchema(JdbcDataSourceReader.java:116) at com.hortonworks.spark.sql.hive.llap.JdbcDataSourceReader.readSchema(JdbcDataSourceReader.java:128) at com.hortonworks.spark.sql.hive.llap.JdbcDataSourceReader.<init>(JdbcDataSourceReader.java:72) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector.getDataSourceReader(HiveWarehouseConnector.java:72) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector.createReader(HiveWarehouseConnector.java:40) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$SourceHelpers.createReader(DataSourceV2Relation.scala:161) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:224) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:187) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl.executeJdbcInternal(HiveWarehouseSessionImpl.java:295) at com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl.sql(HiveWarehouseSessionImpl.java:159) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) Once thrown, execution continues and ends with no errors or missing data. The spark-submit command is the following: spark-submit --master yarn --driver-memory 1g --queue <queue_name> --conf spark.pyspark.python=/opt/venv/pdr/bin/python3.6 --conf spark.pyspark.driver.python=/opt/venv/pdr/bin/python3.6 --jars /opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar --py-files /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/pyspark_hwc-1.0.0.7.1.7.1000-141.zip /home/<path_to_Python_script>/script.py Configuration settings inside the Python script (script.py) are the following: from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .enableHiveSupport() \ .appName(appname) \ .config("spark.yarn.queue","<queue_name>") \ .config("spark.datasource.hive.warehouse.read.via.llap","false") \ .config("spark.sql.hive.hiveserver2.jdbc.url.principal","hive/_HOST@<domain>") \ .config("spark.datasource.hive.warehouse.read.mode","JDBC_CLUSTER") \ .config("spark.sql.extensions","com.hortonworks.spark.sql.rule.Extensions") \ .config("hive.support.concurrency", "true") \ .config("hive.enforce.bucketing","true") \ .config("hive.exec.dynamic.partition.mode", "nonstrict") \ .config("hive.txn.manager","org.apache.hadoop.hive.ql.lockmgr.DbTxnManager") \ .config("hive.compactor.initiator.on", "true") \ .config("hive.compactor.worker.threads","1") \ .config("hive.tez.container.size", "12288") \ .config("tez.queue.name","<queue_name>") \ .config("mapred.job.queuename","<queue_name>") \ .config("spark.executor.core",3) \ .config("spark.executor.memory","6g") \ .config("spark.shuffle.service.enabled","true") \ .config("spark.dynamicAllocation.enabled","true") \ .config("spark.dynamicAllocation.minExecutors",0) \ .config("spark.dynamicAllocation.initialExecutors",1) \ .config("spark.dynamicAllocation.maxExecutors",20) \ .config("spark.kryo.registrator","com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator") \ .config('spark.kryoserializer.buffer.max', '128m')\ .config('spark.sql.autoBroadcastJoinThreshold', -1)\ .config("spark.sql.hive.hiveserver2.jdbc.url","jdbc:hive2://<hive2_jdbc_URL>:10000/default;tez.queue.name=<queue_name>;ssl=true") As said, script is correctly executed and results are returned. Changing driver-memory and/or spark.executor.core / spark.executor.memory does not change the fact that the warning is still thrown. Any idea? Thank you, Andrea

AndreaCavenago · ‎06-02-2022

Hi JM, yes it's a kerberized environment. KRs, Andrea

AndreaCavenago · ‎06-02-2022

Hello, due to a problem with a script, we have almost saturated the HDFS available space. I suppose this is caused by temporary Hive files that have not been cleaned up due to the abnormal termination of the script. I would like to check the /tmp/hive folder on HDFS but my user, that has administrative privileges, cannot access the /tmp/hive folder. Is there a way to check and clean such folder? Any help would be appreciated. KRs, Andrea

Online	Offline
Last Visited	‎01-29-2024 09:54 AM

Member Since	‎02-16-2022 03:01 AM
Last Visited	‎01-29-2024 09:54 AM
Posts	9
Kudos received	2

Cloudera Community

Data Visualization - Users & Groups

Re: CDP Public Cloud - Resizing of Worker/Compute ...

CDP Public Cloud - Resizing of Worker/Compute Node...

Re: Error closing output stream Warning

Re: Error closing output stream Warning

Re: Error closing output stream Warning

Error closing output stream Warning

Re: Cleaning Hive temp directory -Permissions prob...

Cleaning Hive temp directory -Permissions problems...