About Carolyn

Carolyn · ‎04-25-2017

NOTE: This issue was found in the HDF Sandbox. A new Sandbox will be posted soon. Until then, please use the instructions below. After importing the HDP 2.6 Sandbox hosted in Oracle VirtualBox and starting the VM, I saw the message "Connectivity issues detected!" in the console and when I tried to connect to Ambari in the web browser, I was not able to connect. To correct the connectivity: 1. Go to the Oracle VM VirtualBox Manager. Select the sandbox virtual machine. Right click and select Close > Power off to shut down the VM. 2. Right click on the sandbox virtual machine and select Setting. The VM Settings dialog displays. 3. Click on Network from the list on the left side of the dialog. 4. Click on Advanced to unfold the Advanced network settings. 5. Check Cable Connected. 6. Click OK to save the setting. 7. Restart the VM. 8. You should now be able to connect to Ambari by entering the url http://127.0.0.1:8080 in your browser.

Carolyn · ‎03-31-2017

Metron supports 3 types of parsers: Grok, CSV and Java. For XML data Java is the best choice. You can see example parsers in the Metron github: https://github.com/apache/incubator-metron/tree/master/metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers You could also use Nifi to convert the XML to JSON and enqueue the events to the enrichment topic. Here are some articles about parsing XML logs with Nifi: https://community.hortonworks.com/articles/25720/parsing-xml-logs-with-nifi-part-1-of-3.html

Carolyn · ‎03-31-2017

Many of us in Hortonworks Community Connection feel most at home when we are talking about technologies and tools and the "animals in the zoo". However if we want to grow the data lake and gain support from the business we have to learn to think a little differently and use a new vocabulary to communicate. Start by meeting with the business to identifying possible use cases. Talk to the analysts about the highest priorities and pain points for the business. Before thinking about anything remotely Hadoop animal like, summarize "what" needs to be done. This may take several interviews with different business analysts to gain a full understanding of the problem. Then determine if Big Data can solve the problem. Are data silos preventing the organization from getting a complete view of the customer or logistics? Is the volume of data required to solve the problem too much or too expensive for existing systems to handle? Are the unstructured or semi-structured data required to solve the problem not working effectively in existing systems? If the answer to any of these questions is yes, then Big Data is likely a good fit. Next calculate the return of the solution to the business. Return can come from cost savings from increased efficiency or reduction in loss, increased sales resulting from improved customer satisfaction, or new revenue and growth from new data products. Then estimate the investment required for the solution. What are the costs of the development and infrastructure required for the solution? How much will it cost to operationalize the solution? How much will it cost to maintain the solution in coming years? The value of the solution is the return minus the investment. Project the figures out over several years. The first year the development, infrastructure, and operationalization costs will most likely be higher so the value will be lower. However if the maintenance costs are low, years two and three may have much higher value with lower investment. Let's look at some example use cases: 1. Customer 360 is bringing everything that the organization knows about the customer into the data lake. The insights gained from Customer 360 can reduce churn, improve customer loyalty and improve campaign effectiveness. The return is the estimate of increased sales due to reduced churn and better campaign performance. The investment is how much it costs to develop the Customer 360, the costs to obtain the data needed, the infrastructure and personnel required to run the system, and the training required to enable analysts to use it effectively. 2. Fraud detection is preventing loss due to theft. For example a retailer can flag fraudulent returns of stolen goods or detect theft of merchandise. The return is estimated by measuring the amount of loss that could be prevented and the investment is the costs to develop the system, the cost of the infrastructure and personnel to run the system, and the costs to deploy the system to stores. 3. Predictive maintenance optimizes downtime and reduces the cost of maintaining machinery in a factory or vehicles in a fleet. Predictive maintenance uses algorithms that look at the historical failure of parts and the operating conditions of the machines and determines what maintenance needs to be done and when. The return of predictive maintenance is calculated by the reductions in downtime or breakdowns and the savings in parts and labor of only doing maintenance when it is indicated by the operating conditions. How much does a breakdown or downtime cost? Will the contents of the vehicle be lost if the vehicle is down for a lengthy period of time? How much is lost in sales when a delivery is not completed? How much is spent on maintenance and what is the cost of preventable maintenance? The investment is the cost to collect of the machinery or vehicle information, the cost to develop the algorithms and the infrastructure needed to collect and process the machine or fleet data. Examine the results of the use case discovery and build a roadmap that shows which use cases will be implemented and when the implementation will start and end. Create a map of the use cases on two dimensions: value and difficulty of implementing. Start with the high value use cases that are easy to implement. Save the higher value but more difficult to implement use cases for later in the road map. Your team will be more experienced and better able to tackle these use cases. Communicate the road map to the business in terms of the value and investment required. Don't dive into too many technical details. Keep it high level and focus on the what and the why. When you start executing on your use cases don't forget to measure. Tracking your actual return and investment will help you realize the value the solutions and improve your estimation skills going forward.

Carolyn · ‎03-30-2017

Sometimes data in the system expires because it is no longer correct or the data was rented for a specific time period. One way to implement data expiration requirements is to delete the data after it is no longer valid. However you may also have another policy that requires retention of the data to track how decisions were made or for compliance with regulations. In addition deleting the data is more error prone to implement because an administrator must track a task in the future to delete the data after it expires. If the task is missed and the data is not deleted, expired data or illegal data could lead to incorrect decisions or lapses in compliance. This article shows an example of specifying the expiration date for a Hive table in Atlas and creating a tag based policy that prevents access of the table after the expiration date. Enabling Atlas in the Sandbox 1. Create a Hortonworks HDP 2.5 Sandbox You can use either a Virtual Machine or a host in the cloud. 2. In the browser, enter the Ambari url (http://<sandbox host name>:8080) 3. Log in as user name raj_ops with password raj_ops 4. Atlas and its related services are stopped by default in the Sandbox. Follow the instructions in section 1 of the Tag Based Policies Tutorial to start the following services and turn off maintenance mode: Kafka, Ranger Tag Sync, HBase, Ambari Infrastructure, and Atlas. Wait for the operations to complete and all services to show green. Be sure to start Atlas last. For example, if HBase is not running, Atlas will not start properly and remain red in Ambari after it is started. Creating a Hive finance Database and tax_2015 Table 5. First we will create a new Hive database and a new table. Then we will apply a Ranger policy to the table that causes it to expire and demonstrate that only specific users can access the table. 6. Click on the grid icon at the top right side of the window near the user name. 7. Select the Hive View menu option. The Hive View, a GUI interface for executing queries, appears. 8. In the Worksheet in the Query Editor, enter the following Hive statements: CREATE DATABASE finance; DROP TABLE IF EXISTS finance.tax_2015; CREATE TABLE finance.tax_2015(ssn string, fed_tax double, state_tax double, local_tax double) STORED AS ORC; INSERT INTO TABLE finance.tax_2015 VALUES ('123-45-6789',22575,5750,2375); INSERT INTO TABLE finance.tax_2015 VALUES ('234-56-7890',31114,8765,2346); INSERT INTO TABLE finance.tax_2015 VALUES ('345-67-8901',35609,10123,3421); 9. Click the Execute button. The Execute button will turn orange with the label Stop Execution and will return to green with the label Execute when the statements complete. Verifying Both maria_dev and raj_ops Users can Access tax_2015 Table 10. Once the Execute button is green you should see the finance database appear in the Database Explorer on the left side of the screen. 11. Click finance database. The tax_2015 table will appear. 13. We will now verify that maria_dev can also access the table. In the upper right corner pull down the menu with the user name (raj_ops). 14. Select Sign Out. The login screen will appear. Log in using user maria_dev and password maria_dev. 15. Select the tile icon and open the Hive View. Repeat the sample query for the tax_2015 table in the finance database. Verify that the query completes and maria_dev has access to the tax_2015 table. Creating Tag Service and Expires on Tag Based Policy 16. Sign out of Ambari and log in again using user raj_ops and password raj_ops. 17. We will now create a tag based policy in Ranger to deny access to expired data. First we need to add a Tag service. 18. Click Dashboard at the top of the window. 19. Click on Ranger in the list of services. 20. Select Quick Links -> Ranger Admin UI. 21. Enter the user name raj_ops with password raj_ops. Pull down the Access Manager menu and select Tag Based Policies. 22.. If you don’t have a Sandbox_tag service already, select the + button to add a new Service. 23. Enter Sandbox_tag in the Service Name field and click Add. 24. We will now associate the new Tag Service with resource service for hive. Even if you already had a Sandbox_tag service, complete the next steps to verify that the Sandbox_tag service is associated with the Sandbox_hive service. If the tag service is not associated, tag based policies will not function properly. 25. Pull down the Access Manager menu and select Resource Based Policies. 26. Click on the pencil button to the right of the Sandbox_hive service. The Edit Service form appears. 27. Select Sandbox_tag from the Select Tag Service. 28. Click Save to save the changes to the hive service. 29. Pull down the Access Manager menu and select Tag Based Policies. 30. Click on the Sandbox_tag link. 31. An EXPIRES_ON policy is created by default. 32. Click on the Policy ID column for the EXPIRES_ON policy. By default all users are denied access to data after it expires. 33. We will now add a policy that allows raj_ops to access the expired data. Scroll down to the Deny Conditions and click show to expand the Exclude from Deny Conditions region. 34. Select raj_ops in Select User. 35. Click the + icon in the Policy Conditions column. 36. Enter yes in the Accessed after expiry_date. Click the green check icon to save the condition. 37. Click the plus button in the Component Permissions column. 38. Select hive from components and check hive to permit all hive operations. Click the green check button to save the Component Permissions. 39. The Deny and Exclude from Deny Conditions should look like the ones below. Everyone except raj_ops is denied access to all expired tables: 40. Click the green Save button at the bottom of the policy to save the policy. Setting the Expiration Date for the tax_2015 Table by applying the EXPIRES_ON Tag 41. Return to Ambari. Log in with user name raj_ops and password raj_ops. Click on Dashboard at the top. Then select Atlas from the left. Then select Quick Links > Atlas Dashboard. 42. The Atlas login appears. Enter the user holger_gov and the password holger_gov. 43. Click on the Tags tab on the left side of the screen. 44. Click on Create Tag. The Create a new tag dialog opens. 45. In the Name field, enter EXPIRES_ON. Click the Create button. 46. Click on the ADD Attribute+ button for the EXPIRES_ON tag. 47. In the Attribute name field enter expiry_date. Click the green Add button. 48. Click on the Search tab. 49. Toggle right to select DSL. Select hive_table from Search For drop down. Click the green Search button. 50. Locate the tax_2015 table. Click on the + in the Tags column. The Add Tag dialog appears. 51. Select EXPIRES_ON from the drop down. 52. Set the expiry_date attribute to 2015/12/31 Then click the green Add button. Verifying raj_ops can Access tax_2015 but maria_dev can't 53. Return to the Ambari Hive View and log in as raj_ops. 54. Enter the query below in the Query Editor: select * from finance.tax_2015; 55. Click the green Execute button. 56. The query succeeds without error and the results appear in the bottom of the window. 57. Sign out of Ambari and log back in as maria_dev. 58. Enter the same query in the Query Editor: select * from finance.tax_2015; 59. The following error is reported: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [maria_dev] does not have [SELECT] privilege on [finance/tax_2015/fed_tax,local_tax,ssn,state_tax] Inspecting the Audit Log to See Denial by EXPIRES_ON Rule 60. To see which policy caused the error. Return to Ranger and log in as raj_ops. 61. Click on Audit. Then click on the Access tab. 62. Click in the filter to select SERVICE TYPE Hive and RESULT Denied. 63. If you click on the Policy ID link, you will see the policy that caused the denial is the EXPIRES_ON policy. Conclusion This article shows how to create a tag based policy using Atlas and Ranger that prevents access to data after a specified date. Data expiration policies make it easier to comply with regulations and prevents errors caused by using out of date tables.

Carolyn · ‎02-16-2017

@Eric Hanson I have not used IntelliJ so I can't advise on options there. However try building the uber jar, scp it to your edge node and use spark-submit. This will verify you have the correct jar building. As for local testing the spark-testing-base mentioned in Tim's article looks like it will work for unit tests but at some point you are going to need to run it on a remote cluster.

Carolyn · ‎02-16-2017

@Eric Hanson I think the problem is that the maven pom is not creating an Uber Jar. When a spark job runs remotely, all the jars for the job need to be sent to the worker nodes. The class not found is because some of the jars are missing. The build goes ok because the build system has the jars but they the jar is packaged, it doesn't contain all the dependent jars it needs. This stack overflow article has a good description: http://stackoverflow.com/questions/37130178/classnotfoundexception-spark-submit-scala Also see the optional instruction 3 on building an uber jar (one that contains all the dependencies) http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/using-spark-streaming.html Below instruction 3 are the command lines to use whether an uber jar is created or not. If an uber jar is created you just specify the jar in the command line. If it is not an uber jar, you need to pass in all the dependent jars as well.

Carolyn · ‎02-15-2017

@Serge Kazarin Check the port in the hive connection string and make sure the port is 10500 (hive2 interface). The hive1 interface will not understand the LLAP configuration settings.

Carolyn · ‎02-14-2017

@Eric Hanson Check out this article by @Timothy Spann https://dzone.com/articles/testing-spark-code

Carolyn · ‎02-14-2017

@Eric Hanson Can you send your spark-submit command line?

Carolyn · ‎02-14-2017

@Eric Hanson Do you need the "All Spark Repository"? Can you try removing that repo?

Online	Offline
Last Visited	‎11-13-2024 05:05 PM

Member Since	‎08-02-2019 06:47 AM
Last Visited	‎11-13-2024 05:05 PM
Posts	131
Kudos received	93

Cloudera Community

Re: what need to consider when adding new 17 kafka...

Re: Nifi web page does not start Windows 2008 Serv...

Re: HDP 2.6 on HD Cloud - HiveServer2 interactive ...

Re: Ingesting XML Telemetry in Metron

Re: NiFi ExecuteFlumeSource error - "unable to loa...

How to configure networks on the VirtualBox HDP 2...

Re: Ingesting XML Telemetry in Metron

How to Make a Business Case for Big Data

Using Atlas and Ranger to Enforce Data Expiration ...

Re: Local Spark Development against a remote clust...

Re: Local Spark Development against a remote clust...

Re: How to use Hive testbench to perform benchmark...

Re: Local Spark Development against a remote clust...

Re: Local Spark Development against a remote clust...

Re: Local Spark Development against a remote clust...