About anarasimham

anarasimham · ‎06-18-2018

You may not be accounting for the driver RAM. Spark creates a driver process to act as a "parent" from which the executor processes spawn as separate YARN jobs. You are specifying the executor memory as 2GB but you did not specify the driver's memory limit. By default, the driver is allocated 1GB of RAM thus explaining your calculations. https://spark.apache.org/docs/latest/configuration.html

anarasimham · ‎06-12-2018

According to the relevant Phoenix JIRA ticket, it is still in the Unresolved state so you will have to use the HBase APIs directly. https://issues.apache.org/jira/browse/PHOENIX-590

anarasimham · ‎05-23-2018

@Bhanu Pamu Please refer to the following section of our documentation to enable HDFS storage for Zeppelin HA: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/ch_zeppelin_upgrade_hdfs_storage.htm

anarasimham · ‎05-16-2018

I couldn't find any documentation on this specific calculation, but you can understand it through testing as you have already. If you'd like to verify, insert a 2GB file into HDFS and get measurements before and after the insert. You should see the numbers change by the respective amounts.

anarasimham · ‎05-16-2018

You actually should not have to 'source Profile' before running cbd start. If you run cbd start in the same directory as Profile, then it will pull in those variables for you without having to 'source' them from the file.

anarasimham · ‎05-15-2018

@Vinuraj M This is referring to the replication factor of HDFS which defaults to 3. This means that files you place on HDFS are stored 3 times on disks across the cluster for redundancy/node failure tolerance purposes. Therefore your 'du -h' will give you the sum of file sizes you have places on HDFS whereas the HDFS disk usage will give you the total disk space consumed. 6.XX GB * 3 replication factor = ~19 GB

anarasimham · ‎05-15-2018

@Pankaj Singh Complete instructions for setting up Cloudbreak on Google Cloud can be found here: https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.4.0/content/gcp-launch/index.html UAA_DEFAULT_USER_PW needs to be set in a file named 'Profile' from which you initialize Cloudbreak.

anarasimham · ‎05-11-2018

@Jorge Florencio We used Hive 2 currently for LLAP functionality, and Hive 1 for Hive on Tez. If you enable "Interactive Query" in the Configs tab you'll see the following output when you connect via Beeline. Let me know if there is a particular reason you want to use Hive 2. Connected to: Apache Hive (version 2.1.0.2.6.4.0-91) Driver: Hive JDBC (version 1.2.1000.2.6.4.0-91) Transaction isolation: TRANSACTION_REPEATABLE_READ

anarasimham · ‎05-10-2018

This article is based on the following Kaggle competition: https://www.kaggle.com/arjunjoshua/predicting-fraud-in-financial-payment-services It is a Scala-based implementation of the data science exploration written in Python. In addition to training a model, we also have the ability to batch-evaluate a set of data stored in a file through the trained model. Full configuration, build, and installation instructions can be found at the GitHub repo: https://github.com/anarasimham/anomaly-detection When you execute the model training, you'll get various lines of output as the data is cleaned and the model is built. To view this output, use the link provided by the Spark job console output. This will look like the following: 18/05/10 14:33:58 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: <HOST_IP> ApplicationMaster RPC port: 0 queue: default start time: 1525962717635 final status: SUCCEEDED tracking URL: http://<SPARK_SERVER_HOST_NAME>:8088/proxy/application_1525369563028_0053/ user: root The last few lines, which show the trained model, look like this: +-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+ |label|features |probabilities |prediction| +-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+ |0 |(10,[0,2,5,8,9],[1.0,1950.77,106511.31,-1950.77,104560.54]) |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |(10,[0,2,5,8,9],[1.0,3942.44,25716.56,-3942.44,21774.120000000003]) |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,7276.69,93.0,0.0,1463.0,0.0,0.0,-7183.69,-5813.69] |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |(10,[0,2,5,8,9],[1.0,13614.91,30195.0,-13614.91,16580.09]) |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,17488.56,14180.0,0.0,182385.22,199873.79,0.0,-3308.5600000000013,-34977.130000000005] |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,19772.53,0.0,0.0,44486.99,64259.52,0.0,-19772.53,-39545.06] |[0.9777384996414185,0.02226151153445244] |0.0 | |1 |(10,[0,2,3,7,9],[1.0,20128.0,20128.0,1.0,-20128.0]) |[0.022419333457946777,0.9775806665420532]|1.0 | |0 |[1.0,0.0,33782.98,0.0,0.0,39134.79,16896.7,0.0,-33782.98,-11544.890000000003] |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,34115.82,32043.0,0.0,245.56,34361.39,0.0,-2072.8199999999997,-68231.65] |[0.9777384996414185,0.02226151153445244] |0.0 | The original data is split into training data and test data, and the above is the results of running the test data through the model. The "label" column denotes which label (0=legitimate, 1=fraudulent) the row of data truly falls into The "features" column is all the data that went into training the model, in vectorized format because that is the way the model understands the data The "probabilities" column denotes how likely the model thinks each of the labels is (first number being 0, second number being 1), and the "prediction" column is what the model thinks the data falls into. You can add additional print statements and re-run the training to explore When you execute the evaluation portion of this project (instructions in the GitHub repo), you will re-load the model from disk and use test data from a file to see if the model is predicting correctly. Note that it is a bad practice to use test data from the training set (like I have) but for simplicity I have done that. You can go to the Spark UI as above to view the output. And there you have it, a straightforward approach to building a Gradient Boosted Decision Tree Machine Learning model based off of financial data. This approach can be applied not only to Finance but can be used to train a whole variety of use cases in other industries.

anarasimham · ‎03-27-2018

Could you share more details? It looks like you are not running as a privileged user, if you're getting an 'operation not permitted' error, but it is hard to know without context. Are any of your other services starting? Have you ever been able to start the Spark2 History Server? Is this a recent issue? What does your cluster look like? # nodes, specs for each node? Any other issues you've noticed, other details you can provide?

Online	Offline
Last Visited	‎01-16-2019 11:09 AM

Member Since	‎01-14-2019 08:10 AM
Last Visited	‎01-16-2019 11:09 AM
Posts	144
Kudos received	48

Cloudera Community

Re: MAPREDUE code VS sequential code?

Re: How can i get “Hortonworks HDPCA_x.x PracticeE...

Re: Services report

Re: Best way to restart the NiFi cluster

Re: Divide file in nifi

Re: worker uses more ram than it should

Re: how to query all versions in Phoenix

Re: Configuration Steps for Central Repository of ...

Re: Query on Disk Usage (DFS Used) in ambari UI

Re: Cloudbreak environment variable issue Importe...

Re: Query on Disk Usage (DFS Used) in ambari UI

Re: Cloudbreak environment variable issue Importe...

Re: Cannot use Hive 2 connection using beeline

Anomaly Detection in Finance - Using Spark Scala a...

Re: unable to start spark2 history server in HDP 2...