About IgorYakushin

Daniel51 · ‎06-15-2019

@diebestetest wrote: Hi, Could you please share the Entire console logs for further analysis? Thanks Arun Sorry not familiar with the topic.

ArchenROOT · ‎05-31-2019

Its a problem with permissions, you need to let spark let know about local dir, following code then works: def xmlConvert(spark): etl_time = time.time() df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load( '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/train/') df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum( "TagValue").na.fill(0) df.repartition(1).write.csv( path="/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/result/", mode="overwrite", header=True, sep=",") print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time)) if __name__ == '__main__': spark = SparkSession \ .builder \ .appName('XML ETL') \ .master("local[*]") \ .config('job.local.dir', '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \ .config('spark.driver.memory','64g') \ .config('spark.debug.maxToStringFields','200') \ .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \ .getOrCreate() print('Session created') try: xmlConvert(spark) finally: spark.stop()

IgorYakushin · ‎04-06-2017

Hi Jordan, Yes, Cloudera also recommended increasing the heap size and after I did it couple weeks ago, I did not see any more crashes. It is rather surprising though that the default configuration causes crashes. That raises the question how optimal or even acceptable other parameters are and how to tune them. Thank you, Igor

IgorYakushin · ‎03-02-2017

And a lot of other components have TLS in Security section ... Are those mandatory or only needed for Kerberos?

bgooley · ‎02-24-2017

Hi @IgorYakushin, To add to what @mbigelow mentioned, you can enable Kerberos without using TLS to secure communication between your agents and Cloudera Manager, but that would allow the kerberos keytabs to be transmitted from Cloudera Manager to your agents in the clear (risking a malicious party gaining access to your ketyab). Most of the security you will likley need is taken care of by inabling TLS for Agent communication in this section: Configuring TLS Encryption for Cloudera Manager Agents This will encrypt communication when the agent gets the keytabs and other files from CM. If you want more security by having the agents do verification of Cloudera Manager's certificate signer and hostname, then you can configure your trust file for each agent (to trust the CM signer). In summary, you don't need to have TLS enabled to enable Kerberos. If you need to protect the keytabs, enable TLS Encryption for Agents. If you need higher security by having the agents trust the signer of the Cloudera Manager server certificate, you can proceed with the other steps: https://www.cloudera.com/documentation/enterprise/latest/topics/how_to_configure_cm_tls.html#topic_3 Ben

IgorYakushin · ‎02-23-2017

Thank you Ben. That worked. Apparently I was reading older version of the documention where Step 3 (letting CM know where truststore is) is not mentioned. I tried to put it on the same web page where keystore was and that did not work. I did not realize there is yet another page to specify truststore. Igor

Laith · ‎02-01-2017

Hello Igor, To create a partition in Linux, you’d need to ‘fdisk’ it first. In your example, (sdb) is the disk, so you’d need to to create the partition (sdb1): fdisk /dev/sdb After that, you’d need to format the new partition into an ext4: mkfs.ext4 /dev/sdb1 Make sure you are mount it correctly in /etc/fstab, just like I stated in my first response, ‘mount -a’ command is a good way to examine your fstab entries. In regards to the HDFS block size, the block division in HFDS is just logically built over the physical blocks of the ext4 filesystem; HDFS blocks are large compared to disk blocks, and the reason for this is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. If there are any additional questions, please let me know. Thanks, Laith

Online	Offline
Last Visited	‎01-02-2019 06:00 PM

Member Since	‎01-24-2017 03:52 PM
Last Visited	‎01-02-2019 06:00 PM
Posts	69
Kudos received	2

Cloudera Community

Re: pyspark crashes when running locally but works...

Re: Writing from Spark to a shared file system

Re: HBase: Thrift error occurred during processing...

Re: Where to put my.truststore?

Re: "Enabling Server Certificate Verification on C...

Re: After enabling HTTPS to CM, monitors stopped w...

Re: ext3 or ext4 or xfs?