Member since
03-24-2016
184
Posts
239
Kudos Received
39
Solutions
12-29-2016
06:40 PM
6 Kudos
This tutorial is a follow on to the Apache Spark Fine Grain Security with LLAP Test Drive tutorial. These two articles cover the entire range of security authroization capabilities available for Spark on the Hortonworks Data Platform. Getting Started Install an HDP 2.5.3 Cluster via Ambari. Make sure the following components are installed: Hive Spark Spark Thrift Server Hbase Ambari Infra Atlas Ranger Enable LLAP Navigate to the Hive Configuration Page and click Enable Interactive Query. Ambari will ask what host group to put the Hiveserver2 service into. Select the Host Group with the most available resources. With Interactive Query enabled, Ambari will display new configurations options. These options provide control of resource allocation for the LLAP service. LLAP is a set of long lived daemons that facilitate interactive query response times and fine grain security for Spark. Since the goal of this tutorial is to test out fine grain security for Spark, LLAP only needs a minimal allocation of resources. However, if more resources are available, feel free to crank up the allocation and run some Hive queries against the Hive Interactive server to get a feel for how LLAP improves Hive's performance. Save configurations, confirm and proceed. Restart all required services. Navigate to Hive Summary tab and ensure that Hiveserver2 Interactive is started Download Spark-LLAP Assembly From the command line as root: wget -P /usr/hdp/current/spark-client/lib/ http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/1.0.0.2.5.3.0-37/spark-llap-1.0.0.2.5.3.0-37-assembly.jar Copy the assembly to the same location on each host where Spark may start an executor. If queues are not enabled, this likely means all hosts running a node manager service. Make sure all users have read permissions to that location and the assembly file Configure Spark for LLAP - In Ambari, navigate to the Spark service configuration tab: - Find Custom-spark-defaults, - click add property and add the following properties: - spark.sql.hive.hiveserver2.url=jdbc:hive2://{hiveserver-interactive-hostname}:10500 - spark.jars=/usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - spark.hadoop.hive.zookeeper.quorum={some-or-all-zookeeper-hostnames}:2181 - spark.hadoop.hive.llap.daemon.service.hosts=@llap0 - Find Custom spark-thrift-sparkconf, - click add property and add the following properties: - spark.sql.hive.hiveserver2.url=jdbc:hive2://{hiveserver-interactive-hostname}:10500 - spark.jars=/usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - spark.hadoop.hive.zookeeper.quorum={some-or-all-zookeeper-hostnames}:2181 - spark.hadoop.hive.llap.daemon.service.hosts=@llap0 - Find Advanced-spark-env - Set spark_thrift_cmd_opts attribute to --jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - Save all configuration changes - Restart all components of Spark - Make sure Spark-Thrift server is started Enable Ranger for Hive - Navigate to Ranger Service Configs tab - Click on Ranger Plugin Tab - Click the switch labeled "Enable Ranger Hive Plugin" - Save Configs - Restart All Required Services Create Stage Sample Data in External Hive Table - From Command line cd /tmp
wget https://www.dropbox.com/s/r70i8j1ujx4h7j8/data.zip
unzip data.zip
sudo -u hdfs hadoop fs -mkdir /tmp/FactSales
sudo -u hdfs hadoop fs -chmod 777 /tmp/FactSales
sudo -u hdfs hadoop fs -put /tmp/data/FactSales.csv /tmp/FactSales
beeline -u jdbc:hive2://{hiveserver-host}:10000 -n hive -e "CREATE TABLE factsales_tmp (SalesKey int ,DateKey timestamp, channelKey int, StoreKey int, ProductKey int, PromotionKey int, CurrencyKey int, UnitCost float, UnitPrice float, SalesQuantity int, ReturnQuantity int, ReturnAmount float, DiscountQuantity int, DiscountAmount float, TotalCost float, SalesAmount float, ETLLoadID int,LoadDate timestamp, UpdateDate timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tmp/FactSales'" Move data into Hive Tables - From Command line beeline -u jdbc:hive2://{hiveserver-host}:10000 -n hive -e "CREATE TABLE factsales (SalesKey int ,DateKey timestamp, channelKey int, StoreKey int, ProductKey int, PromotionKey int, CurrencyKey int, UnitCost float, UnitPrice float, SalesQuantity int, ReturnQuantity int, ReturnAmount float, DiscountQuantity int, DiscountAmount float, TotalCost float, SalesAmount float, ETLLoadID int, LoadDate timestamp, UpdateDate timestamp) clustered by (saleskey) into 7 buckets stored as ORC"
beeline -u jdbc:hive2://{hiveserver-host}:10000 -n hive -e "INSERT INTO factsales SELECT * FROM factsales_tmp" View Meta Data in Atlas - Navigate to the Atlas Service - Click on Quicklinks --> Atlas Dashboard - user: admin password: admin - Create a new Tag called "secure" - Click on Search --> Flip the Switch to "DSL" --> Select "hive_table" and submit the search - When we created the sample Hive tables earlier, the Hive Hook updated Atlas with meta data representing the newly created data sets - Click on Factsales to see details including lineage and schema information for Factsales Hive table - Scroll down and click on the Schema tab - Click on the Plus sign next to the Storekey column to add tag and add the "secure" tag we created earlier - The storekey column of the factsales hive table is now tagged as "secure". We can now configure Ranger to secure access to the storekey field based on meta data in Atlas. Configure Ranger Security Policies - Navigate to the Ranger Service - Click on Quicklinks --> Ranger Admin UI - user: admin password: admin - Click on Access Manager --> Tag Based Polices -Click the Plus Sign to add a new Tag service -Click Add New Policy, name and add the new service - The new tag service will show up as a link. Click the link to enter the tag service configuration screen. - Click Add New Policy - Name the policy and enter "secure" in the TAG field. This tag refers to the tag we created in Atlas. Once the policy is configured, The Ranger Tag-Synch service will look far notification from Atlas that the "secure" tag was added to an entity. When it sees that notification, it will update Authorization as described by the Tag based policies. - Scroll down and click on the link to expand the Deny Condition section - Set the User field to User hive and the component Permission section to Hive - Click Add to finalize and create the policy. Now Atlas will notify Ranger whenever an entity is tagged as "secure" or the "secure" tag is removed. The "secure" tag policy permissions will apply to any entity tagged with the "secure" tag. - Click on Access Manager and select Resource Based Policies - Next to the {clustername}_hive service link, click the edit icon (looks like a pen on paper). Make sure to click the icon and not the link. - Select the Tag service we created earlier from the drop down and click save. This step is important as this is how Ranger will associate the tag notifications coming from Atlas the Hive security service. - You should find yourself at Resource Based Policies screen again. This tim click on {clustername}_hive service link, under the Hive section - Several default Hive security policies should be visible. - User hive is allowed access to all tables and all columns - The cluster is now secured with Resource and Tag based policies. Let's test out how these work together using Spark. Test Fine Grain Security with Spark - Connect to Spark-Thrift server using beeline as hive User and verify sample tables are visible beeline -u jdbc:hive2://{spark-thrift-server-host}:10015 -n hive
Connecting to jdbc:hive2://{spark-thrift-server-host}:10015
Connected to: Spark SQL (version 1.6.2)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive
0: jdbc:hive2://{spark-thrift-server-host}:10015> show tables;
+----------------+--------------+--+
| tableName | isTemporary |
+----------------+--------------+--+
| factsales | false |
| factsales_tmp | false |
+----------------+--------------+--+
2 rows selected (0.793 seconds)
- Get the Explain Plan for a simple query 0: jdbc:hive2://sparksecure01-195-1-0:10015> explain select storekey from factsales;
| == Physical Plan == |
| Scan LlapRelation(org.apache.spark.sql.hive.llap.LlapContext@44bfb65b,Map(table -> default.factsales, url -> jdbc:hive2://sparksecure01-195-1-0.field.hortonworks.com:10500))[storekey#66] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
2 rows selected (1.744 seconds)
- The explain plan should show that the table will be scanned using the LlapRelation class. This confirms that Spark is using LLAP to read from HDFS. - Recall that the User hive should have complete access to all databases, tables, and columns per the Ranger resource based policy. - Attempt to select storekey from factsales as the User hive - Even though User hive should have full access to the factsales table, we were able to restrict access to the storekey column by designating it as "secure" using a tag in Atlas. - Attempt to select saleskey from factsales as the User hive. The saleskey column is not designated as secure via tag. - Access to the saleskey field is allowed since the User hive has acess and the field is not designated as secure. - Return to the Factsales page in Atlas and remove the "secure" tag from the storekey column. - Wait 30-60 seconds for the notification from Atlas to be picked up, processed, and propagated. - Attempt to select storekey from factsales as the User hive once again. - This time access is allowed since the secured tag has been removed from the storekey column of the factsales table in Atlas. - Back in the Ranger UI, Click on Audit to see all of the access attempts that have been recorded by Ranger. - Notice that the first access attempt was denied based on the tag [secure]. Ranger already provides extremely fine grain security for both Hive and Spark. However, in combination with Atlas, yet another level of security can be added. Tag based security for Spark provides additional flexibility in controlling access to datasets.
... View more
Labels:
12-18-2016
03:15 PM
19 Kudos
This tutorial is a companion to this article: https://community.hortonworks.com/articles/72182/bring-spark-to-the-business-with-fine-grain-securi.html The article outlines the use cases and potential benefits to the business that Spark fine grain security with LLAP may yield. This article also has a second part that covers how to apply Tag based security for Spark using Ranger and Atlas in combination. Tag Based (Meta Data) Security for Apache Spark with LLAP, Atlas, and Ranger Getting Started Install an HDP 2.5.3 Cluster via Ambari.
Make sure the following components are installed:
Hive Spark Spark Thrift Server Ambari Infra Ranger
Enable LLAP Navigate to the Hive Configuration Page and click Enable Interactive Query. Ambari will ask what host group to put the Hiveserver2 service into. Select the Host Group with the most available resources. With Interactive Query enabled, Ambari will display new configurations options. These options provide control of resource allocation for the LLAP service. LLAP is a set of long lived daemons that facilitate interactive query response times and fine grain security for Spark. Since the goal of this tutorial is to test out fine grain security for Spark, LLAP only needs a minimal allocation of resources. However, if more resources are available, feel free to crank up the allocation and run some Hive queries against the Hive Interactive server to get a feel for how LLAP improves Hive's performance. Save configurations, confirm and proceed. Restart all required services. Navigate to Hive Summary tab and ensure that Hiveserver2 Interactive is started
Download Spark-LLAP Assembly From the command line as root: wget -P /usr/hdp/current/spark-client/lib/ http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/1.0.0.2.5.3.0-37/spark-llap-1.0.0.2.5.3.0-37-assembly.jar Copy the assembly to the same location on each host where Spark may start an executor. If queues are not enabled, this likely means all hosts running a node manager service. Make sure all users have read permissions to that location and the assembly file
Configure Spark for LLAP - In Ambari, navigate to the Spark service configuration tab:
- Find Custom-spark-defaults,
- click add property and add the following properties: - spark.sql.hive.hiveserver2.url=jdbc:hive2://{hiveserver-interactive-hostname}:10500 - spark.jars=/usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - spark.hadoop.hive.zookeeper.quorum={some-or-all-zookeeper-hostnames}:2181 - spark.hadoop.hive.llap.daemon.service.hosts=@llap0 - Find Custom spark-thrift-sparkconf, - click add property and add the following properties: - spark.sql.hive.hiveserver2.url=jdbc:hive2://{hiveserver-interactive-hostname}:10500 - spark.jars=/usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - spark.hadoop.hive.zookeeper.quorum={some-or-all-zookeeper-hostnames}:2181 - spark.hadoop.hive.llap.daemon.service.hosts=@llap0 - Find Advanced-spark-env - Set spark_thrift_cmd_opts attribute to --jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.5.3.0-37-assembly.jar - Save all configuration changes - Restart all components of Spark - Make sure Spark-Thrift server is started
Enable Ranger for Hive - Navigate to Ranger Service Configs tab - Click on Ranger Plugin Tab - Click the switch labeled "Enable Ranger Hive Plugin" - Save Configs - Restart All Required Services
Create Stage Sample Data in External Hive Table - From Command line cd /tmp
wget https://www.dropbox.com/s/r70i8j1ujx4h7j8/data.zip
unzip data.zip
sudo -u hdfs hadoop fs -mkdir /tmp/FactSales
sudo -u hdfs hadoop fs -chmod 777 /tmp/FactSales
sudo -u hdfs hadoop fs -put /tmp/data/FactSales.csv /tmp/FactSales
beeline -u jdbc:hive2://sparksecure01-195-3-2:10000 -n hive -e "CREATE TABLE factsales_tmp (SalesKey int ,DateKey timestamp, channelKey int, StoreKey int, ProductKey int, PromotionKey int, CurrencyKey int, UnitCost float, UnitPrice float, SalesQuantity int, ReturnQuantity int, ReturnAmount float, DiscountQuantity int, DiscountAmount float, TotalCost float, SalesAmount float, ETLLoadID int,LoadDate timestamp, UpdateDate timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED ASTEXTFILE LOCATION '/tmp/FactSales'"
Move data into Hive Tables - From Command line beeline -u jdbc:hive2://sparksecure01-195-3-2:10000 -n hive -e "CREATE TABLE factsales (SalesKey int ,DateKey timestamp, channelKey int, StoreKey int, ProductKey int, PromotionKey int, CurrencyKey int, UnitCost float, UnitPrice float, SalesQuantity int, ReturnQuantity int, ReturnAmount float, DiscountQuantity int, DiscountAmount float, TotalCost float, SalesAmount float, ETLLoadID int, LoadDate timestamp, UpdateDate timestamp) clustered by (saleskey) into 7 buckets stored as ORC"
beeline -u jdbc:hive2://sparksecure01-195-3-2:10000 -n hive -e "INSERT INTO factsales SELECT * FROM factsales_tmp"
Configure Ranger Security Policies - Navigate to the Ranger Service - Click on Quicklinks --> Ranger Admin UI - user: admin password: admin - Click on {clustername}_hive service link, under the Hive section - Several Hive security policies should be visible - Add a new Column level policy for User spark as show in the screenshot below. Make sure that the storekey column is excluded from access. - User hive is allowed access to all tables and all columns, user spark is restricted from accessing the storekey column in the factsales table - Click on the Masking Tab - Add a new Masking policy for the User spark to redact the salesamount column - Click on the Row Level Filter tab and add a new Row Level Filter policy for User spark to only show productkey < 100
Test Fine Grain Security with Spark - Connect to Spark-Thrift server using beeline as hive User and verify sample tables are visible beeline -u jdbc:hive2://sparksecure01-195-1-0:10015 -n hive
Connecting to jdbc:hive2://sparksecure01-195-1-0:10015
Connected to: Spark SQL (version 1.6.2)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive
0: jdbc:hive2://sparksecure01-195-1-0:10015> show tables;
+----------------+--------------+--+
| tableName | isTemporary |
+----------------+--------------+--+
| factsales | false |
| factsales_tmp | false |
+----------------+--------------+--+
2 rows selected (0.793 seconds)
- Get the Explain Plan for a simple query 0: jdbc:hive2://sparksecure01-195-1-0:10015> explain select storekey from factsales;
| == Physical Plan == |
| Scan LlapRelation(org.apache.spark.sql.hive.llap.LlapContext@44bfb65b,Map(table -> default.factsales, url -> jdbc:hive2://sparksecure01-195-1-0.field.hortonworks.com:10500))[storekey#66] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
2 rows selected (1.744 seconds)
- The explain plan should show that the table will be scanned using the LlapRelation class. This confirms that Spark is using LLAP to read from HDFS. - Verify that hive User is able to see the storekey, unredacted salesamount, and unfiltered productkey columns in the factsales table, as specified by the policy Hit Ctrl-C to exit beeline - Connect to Spark-Thrift server using beeline as User spark and run the exact same query as the User hive just ran. An exception will be thrown by the authorization plugin because User spark is not allowed to see results of any query that includes the storekey column. - Try the same query but omit storekey column from the request. The response will show a filtered productkey column and a redacted salesamount column.
View Audit Trail - Navigate back to the Ranger Admin UI - Navigate to Audit (Link at the top of the screen) Ranger Audit registers both Allowed and Denied access events Now access to data through Spark Thrift server is secured by the same granular security policies as Hive. Ranger provides the centralized policies, LLAP ensures they are enforced. Now BI tools can be pointed at Spark or Hive interchangeably.
... View more
Labels:
12-16-2016
06:03 PM
4 Kudos
Apache Spark has ignited an explosion of data exploration on very large data sets. Spark played a big role in making general purpose distributed compute accessible. Anyone with some level of skill in Python, Scala, Java, and now R, can just sit down and start exploring data at scale. It also democratized Data Science by offering ML as a series of black boxes. Training of Artificial Intelligence is now possible for those of us who do not have PHDs in Statistics and Mathematics. Now Spark SQL is also helping to bring data exploration to the business unit directly. In partnership with Apache Hive, Spark has been enabling users to explore very large data sets using SQL expression.
However, in order to truly make Spark SQL available for ad-hoc access by business analysts using BI tools, fine grain security and governance are necessary. Spark provides strong authentication via Kerberos and wire encryption via SSL. However, up to this point, Authorization was only possible via HDFS ACLs. This approach works relatively well when Spark is used as a general purpose compute framework. That is, using Java/Scala/Python to express logic that cannot be encapsulated in a SQL statement. However, when structured schema with columns and rows is applied, fine grain security becomes a challenge. Data in the same table may belong to two different groups, each with their own regulatory requirements. Data may have regional restrictions, time based availability restrictions, departmental restrictions, ect. Currently, Spark does not have a built in authorization sub-system. It tries to read the data set as instructed and either succeeds or fails based on file system permissions. There is no way to define a pluggable module that contains an instructions set for fine grain authorization. This means that authorization policy enforcement must be performed somewhere outside of Spark. In other words, some other system has to tell Spark that it is not allowed to read the data because it contains a restricted column.
At this point there are two likely solutions. The first is to create and authorization subsystem within Spark itself. The second is to configure Spark to read the file system through a daemon that is external to Spark. The second option is particularly attractive because it can provide benefits far beyond just security. Thus, the community created LLAP (Live Long and Process). LLAP is a collection of long lived daemons that works in tandem with the HDFS Data Node service. LLAP is optional and modular so it can be turned on or off. At the moment, Apache Hive has the most built in integration with LLAP. However, the intent of LLAP is to generally provide benefits to applications running in Yarn. When enabled, LLAP provides numerous performance benefits:
- Processing Offload
- IO Optimization
- Caching
Since the focus of this articles security for Spark, refer to the LLAP Apache wiki for more details on LLAP.
https://cwiki.apache.org/confluence/display/Hive/LLAP With LLAP enabled, Spark reads from HDFS go directly through LLAP. Besides conferring all of the aforementioned benefits on Spark, LLAP is also a natural place to enforce fine grain security policies. The only other capability required is a centralized authorization system. This need is met by Apache Ranger.
Apache Ranger provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS. Ranger allows authoring of security policies for:
- HDFS
- Yarn
- Hive (Spark with LLAP)
- HBase
- Kafka
- Storm
- Solr
- Atlas
- Knox
Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time. Now that we have defined how fine grain authorization and audit can be applied to Spark, let's review the overall architecture.
Spark receives the query statement and communicates with Hive to obtain the relevant schemas and query plan. The Ranger Hive plugin checks the cached security policy and tells Spark what columns is allowed to access. Spark does not have it's own authorization system so it begins attempting to read from the filesystem through LLAP LLAP reviews the read request and realizes that it contains columns that the user making the request does not have permission to read. LLAP instantly stops processing the request and throws an Authorization exception to Spark. Notice that there was no need to create any type of view abstraction over the data. The only action required for fine grain security enforcement is to configure a security policy in Ranger and enable LLAP.
Ranger also provides column masking and row filtering capabilities. Masking policy is similar to a column level policy. The main difference is that all columns are returned but the restricted columns contain only asterisks or a hash of the original value. Ranger also provides that ability to apply Row level security. Using a Row level security policy, users can be prevented from seeing some of the rows in a table but still display all rows not restricted by policy. Consider a scenario where Financial Managers should only be able to see clients assigned to them. Row level policy from Ranger would instruct Hive to return a query plan that includes a predicate. That predicate filters out all customers not assigned to the Financial Manager trying access the data. Spark receives the modified query plan and initiates processing, reading data through LLAP. LLAP ensures that the predicate is applied and that the restricted rows are not returned. With such an array of fine grain security capabilities, Spark can now be exposed directly to BI tools via a Thrift Server. Business Analyst can now wield the power of Apache Spark. In general, LLAP integration has the potential to greatly enhance Spark from both a performance and security perspective. Fine grain security will help to bring the benefits of Spark to the business. Such a development should help to fuel more investment, collection, and exploration of data. If you would like to test out this capability for yourself, check out the following tutorial: https://community.hortonworks.com/content/kbentry/72454/apache-spark-fine-grain-security-with-llap-test-dr.html
... View more
Labels:
10-02-2016
04:37 PM
1 Kudo
Money management and retirement planning are extremely important issues. The conventional wisdom says to get a Financial Advisor who will invest your money and allow you to reach your financial goals. However, as of late, financial institutions have began to offer services generally referred to as Digital Advisors or "Robo Advisors". These services aim to offer a very low cost, broad spectrum, long term investment advice across a limited set of financial products. The advice is personalized based on basic input from the customer such as level of risk tolerance, initial investment, years to invest, ect. The basic premise behind the offering is diversification. Statistics have shown that few money managers are able to beat the S&P 500 over the long term, especially after fees. The Digital Advisor aims to spread the initial investment across a broad spectrum of securities at a much lower cost. The Digital Advisor provides another major advantage over the Human Financial Advisor. Using the power of distributed processing and Monte Carlo simulation methodology, the Digital Advisor is able to project the returns of the recommended portfolio. This capability allows the Digital Advisor to show a confidence range of likelihoods for future portfolio value. Depending on the level of sophistication, the Digital Avisor could use similar methodology to periodically re-run the projections to determine if the portfolio is on track. If not, the portfolio can be adjusted to address unforeseen market conditions. The financial industry has been using the Month Carol method to project investment returns and evaluate risk for some time. The basic principle behind the Monte Carlo method is to establish a model(mathematical equation) that represents the target scenario. The model should include a representation of a range of possible outcomes. This is typically done using historical performance and current market data. The next step is to develop a pseudo-random number generator to create data that can be plugged into the model. The generator is pseudo-random because it must only generate numbers that would fit into the range of outcomes represented in the model. The last step is to generate and plug pseudo-random numbers into the model over and over again. This idea is that the outcomes that are most likely based on the model, will occur most frequently. Thus, the Monte Carlo simulation can project the likely outcome for an investment strategy. Let's take a look at an example of a simplified Digital Advisor implementation. https://community.hortonworks.com/repos/54060/spark-and-zeppelin-digital-advisor-demo.html https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3Zha3Nob3J0b24vRGlnaXRhbEFkdmlzb3IvbWFzdGVyL3plcHBlbGluL25vdGVib29rL0RpZ2l0YWxGaW5hbmNpYWxBZHZpc29yRGVtby5qc29u
This demonstration is implemented as an Apache Zeppelin notebook using Apache Spark on the Hortonworks Data Platform. Apache Spark is the ideal engine for Monte Carlo simulation. Spark has a very user friendly API, it can spread complex computations across many host's processors, and it can reduce completion time by caching intermediate result sets in memory. Apache Zeppelin is a great compliment to Spark. Zeppelin provides a clean and simple interface that can leverage SparkSQL to visualize complex result sets.
The first two sections create a form for user input and gather historical data from Yahoo. This data is required in order to create the model that the simulation is based on. Next, the Digital Advisor generates a portfolio based on the user's indicated risk tolerance. Real world portfolio selection can be extremely involved. For the purposes of this demonstration, the implementation only ensures that the user's risk tolerance is honored. Finally, the actual model that underpins the simulation is created. The time scale is adjusted to take into account annual as opposed to daily returns. Each instrument in the portfolio is a assigned a likely range of return based on historical standard deviation. When the simulation starts, the algoritm calculates the annual rate of return for each portfolio instrument by plugging in a pseudo-random value within the standard deviation. Using the rate of return, the actual value of each instrument is derived. This step is repeated for the number of years requested by the user, keeping track of total value year over year. The result of this entire process represents a single final possible outcome for the portfolio. In order to have any confidence in the prediction it is necessary to repeat the entire process many times. Each repeated simulation provides an new possible outcome. When the requested number of simulations are completed, Zeppelin leverages SparkSQL to create visualizations of each simulated portfolio path, possible range and confidence of final portfolio value.
... View more
Labels:
08-15-2016
03:47 AM
6 Kudos
This article is a companion to the article "Avro Schema Registry with Apache Atlas for Streaming Data Management". https://community.hortonworks.com/articles/51379/avro-schema-registry-with-apache-atlas-for-streami.html The article explores how an Avro schema registry can bring data governance to streaming data and the benefits that come with it. This tutorial demonstrates the implementation of this concept and some of the resulting features.
Download HDP 2.5 Sandbox modify the hosts file on the local machine to resolve sandbox.hortonworks.com to 127.0.0.1 SSH to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222) ambari-admin-password-reset make sure to set the Ambari password to "admin" Log into Ambari and start the following services (http://sandbox.hortonworks.com:8080/)
HBase Log Search Kafka Atlas From the SSH console:
git clone https://github.com/vakshorton/AvroSchemaShredder cd /root/AvroSchemaShredder chmod 755 install.sh ./install.sh java -jar AvroSchemaShredder-jar-with-dependencies.jar Open a second SSH session to the Sandbox (ssh root@sandbox.hortonworks.com -p 2222) cd /root/AvroSchemaShredder curl -u admin:admin -d @schema/schema_1.json -H "Content-Type: application/json" -X POST http://sandbox.hortonworks.com:8090/schemaShredder/storeSchema
Curl will make a REST API call to the AvroSchemaShredder service to parse the sample Avro schema and store it in Atlas. Log into Atlas: http://sandbox.hortonworks.com:21000 (usr:admin, pass:admin) Search for "avro_schema". The search should return a list of schemas that were created when the request to register schemas was made via the REST service call.
Click into one of the schemas, notice the available information about the top level record The record will have a "fields" attribute that contains links to other sub elements and in some cases, other schemas Now any of the fields of any registered schema can be searched and tagged. Schemas can be associated with Kafka topics allowing discovery of streaming data sources on those topics. Also, notice that the curl REST call returned a GUID. That GUID can be used to access the schema that was registered. This means that a message can be automatically deserialized from a Kafka topic based on the "fingerprint" associated to the message on the Kafka topic. This could be achieved using a standard client that depends on the Avro Schema Registry to deserialize messages. To retrieve the Avro compliant schema notation: get the GUID that the curl command returned after the sample schema was registered curl -u admin:admin -X GET http://sandbox.hortonworks.com:8090/schemaShredder/getSchema/{GUID}
The response should be an Avro compliant schema descriptor This prototype does not handle schema validation or compatibility enforcement. It also does not do any caching to optimize performance or leverage Kafka for asynchronous notification. However, it does demonstrate how the described capabilities can be achieved. Repo: https://community.hortonworks.com/content/repo/51366/avro-schema-shredder.html
... View more
Labels:
08-13-2016
08:59 PM
6 Kudos
Data is the becoming the new precious resource. As the world produces more and more data,
business units find increasingly more ways to monetize that data. This means that data that
used to be retained for a short time or not at all, is now being persisted long term. This
data is being gathered from more and more sources and not necessarily from within the
organization that uses it. It is also increasingly being generated by machines, meaning
that the volume, velocity, and variety of the data proliferate at an overwhelming rate.
There are now lots of tools that enable an organization to address the challenges imposed by
the proliferation of data. However, many organizations have been focused on dealing with
volume and velocity while not focusing on the challenges created by the lack or
inconsistency of structure. In order to truly unlock the power of all that data, an organization must first apply a
consistent set of guidelines for governance of the data. Getting value from new data
sources often requires imposing schemas on unstructured or semi-structured data.
This is because the new data often has to be combined with existing structured data in order for it
to be useful. Schemas can also be important for security as sensitive bits of data are
often mixed in data sets that are generally considered non-sensitive. Finally, business
units generally do not create the technologies that monetize the data. That job falls to
many different engineering groups that are often decentralized. In order to effectively
create the tools that enable harvesting value from data, engineering teams need to agree on
how that data should be used, modified, and enriched. Consider a scenario where two
different engineering teams are working on requirements from two different business units
and have no knowledge of the other's work. When team A wants to evolve the schema of some
data set, they must be sure that the change will not disrupt the work of team B. This is
challenging since team A may not know that team B is using the same data or what they are
doing with it. In addition, team B will likely derive a new data set from the existing
data. That new data set may be exactly what team A needs to deliver what the business has
asked for. Team A needs to be able to discover the fact that team B has produced a new data
set from the one that both teams were using. It used to be that data was primarily stored in silo-ed relational databases in a
structured format. The very existence of data was predicated on the existence of a well defined schema.
In the new world of Big Data plaforms, data is often stored without a schema and in some cases
the data is a stream of messages in a queueing system. Data Governance tools like
Apache Atlas can help with management of data sets and processes that evolve them. The flexibility of
Atlas enables creation of new managed Types that can be used to govern
data sets form just about any data source. In fact, as of Hortonworks Data Platform 2.5, Atlas is used to visualize
and track cross component lineage of data ingested via Apache Hive, Apache Sqoop, Apache Falcon,
Apache Storm, Apache Kafka, and in the future, Apache Nifi. Schemas for Hive tables are stored and
governed, thus covering many data at rest use cases. It makes a lot of sense to manage schemas for streaming
data sources within Atlas as well. Kafka topics are captured as part of Storm
topologies but currently, only configuration information is available. The concept of an Avro Schema Registry
combined with existing governance capabilities of Atlas, would extend the benefits of data governance
to streaming data sets.
In order to extend concept of schema to streaming data sets, a serialization format with a built in the concept of schema is required.
Apache Avro is a commonly used serialization format for streaming data.
It is extremely efficient for writes and includes self describing schema as part of its specification.
Avro schema specification allows for schema evolution that is backward or forward compatible.
Each message can be serialized with its schema so that an independent down stream
consumer is able to deserialize the message. Instead of the full schema, it is also possible
to pass a "fingerprint" that uniquely identifies the schema. This is useful when the
schema is very large. However, using a fingerprint with messages that will travel through
multiple Kafka topics requires that the consumer is able to reference the schema that the
fingerprint refers to. Atlas can be used to not only store Avro schemas but to make them
searchable, and useful for data governance, discovery, and security.
The first step to using Atlas as an Avro Schema Registry is to add new Types that align to
the Avro Schema specification. Avro Schema supports the following types: Records Enums Arrays Maps Unions Fixed Primitives Using the Atlas API, it is possible to create types that exhibit the same kinds of attributes
and nesting structure. The second required component is a service that is capable of parsing
an Avro Schema JSON representation and translating it the new Atlas Avro Types. After registering
the schema, the service should return a fingerprint (GUID) that will act as the claim check for that schema on deserialization.
The service should also handle schema validation and compatibility enforcement. This set
of capabilities would allow automatic deserialization of messages from a Kafka topic. While just having an Avro Schema Registry is valuable for streaming use cases, using Atlas
as the underlying store provides substantial value. Data discovery becomes much easier
since all of the fields in each Avro Schema can be individually indexed. This means that a user
can search for the name of a field and determine the schema and Kafka topic where it can be found.
In many use cases the messages flowing through the Kafka topic flow into a Hive table,
HDFS location, or some NoSQL store. Engineering teams can use the cross component lineage
visualization in Atlas to understand the effects that schema evolution will have downstream.
Atlas also provides the ability to apply tags and business taxonomies. These capabilities
make it really easy to curate, understand, and control how streaming data is deployed and secured.
For example, Apache Atlas integrates with Apache Ranger (Authorization system) to enable tag based
policies. This capability allows column level authorization for data managed by Apache Hive
based on tags applied to the meta data in Atlas. Apache Ranger is also currently able to secure
Kafka topics based on source IP or user name (in Kerberized clusters). Tag based policies
are not yet available for Kafka topics. However, it should be possible to reuse the same
tag synch subsystem used to implement tag based policies in Hive. Tags can also be used
to ensure to deprecate older schemas or prevent evolution of certain schemas through the Registry API.
Finally, because Atlas uses HBase and Solr under the covers, enterprise requirements like HA
and DR capabilities do not need to be re-implemented. It is clear that data governance is becoming absolutely essential component of an enterprise
data management platform. Whether the data is streaming or at rest, both business and
technology organizations need to discover, understand, govern, and secure that data. Combining
capabilities of existing data governance tools like Apache Atlas with schema aware data formats
like Apache Avro (Kafka) and Apache ORC (Hive/Pig/Spark), can help managing Big Data that
much easier.
... View more
Labels:
07-11-2016
03:03 PM
5 Kudos
For a long time, when there was a big job to do, people relied on horses. Whether the job required heavy pulling, speed, or anything in between.
However, not all horses were fit for every task. Certain breeds were valued for their incredible speed and endurance, especially when an important
letter had to be delivered. Others were prized for their ability to carry and pull large payloads whether it was a fully armored knight or huge stone blocks.
Today we rarely rely on horses and much more on technology to get many of the same kinds of jobs done.
Very high volume streaming data is increasingly common in all lines of business because of the value and utility that it often carries.
Horses will be of little help with this type of workload but luckily there is a whole host of tools to deal with streaming data. However, just like with
horses, choosing the the right streaming tool for a particular use case is critical to the success of the project. Consider a use case where a directory full of log files with log entries, need to be broken down
into individual events, filtered/altered, turned into JSON, and then sent on to a Kafka topic. This use case has exactly the kind of requirements that Apache Nifi was designed for. All you would have to do is string together ListHDFS, FetchHDFS, SplitText, ExtractText, AttributesToJSON, and finally PutKafka processor. This Nifi flow would distributed each file in the target directory across the Nifi cluster, extract/alter the events, output each event as JSON, and send them to a Kafka topic. Notice that not a single line of code was required to solve the use case. The same use case can be solved using Spark Streaming but would require a lot of code and an intricate understanding of how and where Spark stores and processes DStreams. This article : http://allegro.tech/2015/08/spark-kafka-integration.html does a great job of outlining how to achieve the same result but required several iterations by a team of engineers familiar with the Spark Streaming. The article explains the importance of understanding which instructions will execute on the driver and which will execute on the executors. It also describes an elegant approach that uses a factory pattern to distribute uninstantiated Kafka producer templates and how to make sure that the templates are only instantiated by the executors, thus avoiding the dreaded "Not Serializable" exception. That is a lot of work to solve such a basic use case. Spark is one of the leading tools for complex computation and aggregation on very large volumes of data. It is extremely well suited for machine learning, time series/stateful calculation, aggregation, graph processing, and iterative computation. However, due to its highly distributed nature, simple event processing that only requires event routing, data transformation, data enrichment, and data orchestration is harder to achieve. Conversely, Apache Nifi is not the right tool to solve most of the complex computation and processing use cases. This is because it was designed to reduce the amount of effort required create, manipulate, and control streams of events as dynamically as possible. It is a graphical UI based distributed framework that is easy to extend and can perform most of the simple event processing tasks out of the box with very little effort or prior experience required. However, there are many enterprise class use cases that are best solved by using both Spark and Nifi together. Consider one more example where a large organization dealing with millions of IOT enabled devices across the country needs to apply predictive algorithms on aggregated data in near realtime. They will need all of the event data to eventually arrive at two or three (in some cases one) processing centers in order to achieve their goals. At the same time, they need to make sure that lost events due to outages are as limited as possible. Such an organization will have many smaller data centers that have small infrastructure foot prints throughout the country. It is not practical to put a Spark Streaming cluster in those data centers nor is it safe to just point all of the devices at one or two data centers with heavy processing footprints. One of the possible approaches could be to run Nifi at the smaller data centers spread across a handful of servers to capture and curate the local event streams. Those streams can then be forwarded as cleaned and secured data sources to the two or three large data centers with large Spark Streaming clusters to apply all of the heavy and complex processing. This approach would address the need for failure group reduction/isolation, minimize the effort required to manage the distributed data collection infrastructure, allow dynamic updates to event manipulation and routing logic, and provide all of the heavy processing capabilities required to apply the intelligence and reporting required to address the business requirements. To conclude our metaphor, Apache Spark Streaming is the heavy war horse and Apache Nifi is the quick and nimble race horse. Some workloads are best suited for one, some for the other, and some will require both working together.
... View more
Labels:
05-08-2016
02:01 AM
@Benjamin Leonhardi With the release of Yarn.Next, the containers will receive their own IP address and get registered in DNS. The FQDN will be available via a rest call to Yarn. If the current Yarn container die, the docker container will start in a different Yarn container somewhere in the cluster. As long as all clients are pointing at the FQDN of the application, the outage will be nearly transparent. In the mean time, there are several options using only slider but it requires some scripting or registration in Zookeeper. If you run: slider lookup --id application_1462448051179_0002
2016-05-08 01:55:51,676 [main] INFO impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2016-05-08 01:55:53,847 [main] WARN shortcircuit.DomainSocketFactory - The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2016-05-08 01:55:53,868 [main] INFO client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
{
"applicationId" : "application_1462448051179_0002",
"applicationAttemptId" : "appattempt_1462448051179_0002_000001",
"name" : "biologicsmanufacturingui",
"applicationType" : "org-apache-slider",
"user" : "root",
"queue" : "default",
"host" : "sandbox.hortonworks.com",
"rpcPort" : 1024,
"state" : "RUNNING",
"diagnostics" : "",
"url" : "http://sandbox.hortonworks.com:8088/proxy/application_1462448051179_0002/",
"startTime" : 1462454411514,
"finishTime" : 0,
"finalStatus" : "UNDEFINED",
"origTrackingUrl" : "http://sandbox.hortonworks.com:1025",
"progress" : 1.0
}
2016-05-08 01:55:54,542 [main] INFO util.ExitUtil - Exiting with status 0
You do get the host the container is currently bound to. Since the instructions bind the docker container to the host IP, this would allow URL discovery but as I said, not out of the box. This article is merely the harbinger to Yarn.Next as that will integrate the PaaS capabilities into Yarn itself, including application registration and discovery.
... View more
04-29-2016
01:03 AM
4 Kudos
Ambari is a great tool to manage the Hortonworks Data Platform. However, as the complexity of the applications that run on HDP grows and more and more components from the stack are required, it may be necessary to begin automating tasks like configuration changes. Using the Ambari REST interface it is possible to install, control, interrogate, and even change configuration of HDP services. This example demonstrates how to automate the install and configuration of Nifi. However, the same approach can be applied to any service. This example uses curl to make all REST API request. Make sure that the bits required to actually install the service in Ambari are available on the host running the Ambari server. In the case of Nifi, the bits should be present in /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/NIFI. 1. Check Service Status: curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/NIFI
Use regex of some JSON parsing library/tool to parse the response. In this particular case you want to get a status 404 as that means that the NIFI service does not yet exist. 2. Create Service: curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/NIFI A check of Ambari UI should confirm that the NIFI service is visible on the left hand pane. 3. Add Components to Service curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/NIFI/components/NIFI_MASTER
The service is merely the container that holds all of the process that comprise it. When a request to start the service is issued, it actually attempts to start each of the components defined for that service, which are the actual processes that provide functionality. 4. Configure the Components: This is the tricky part. One of the great values of Ambari is that it provides the ability for administrators to add and change configuration for HDP services from a single place, This means that when a service is being installed, the configurations and the files they are contained in must be defined as well. Each component of each service will have its own configurations files each file will have a unique set of properties and formats. This means that each configuration file that the Ambari wrapped service expects must be defined and applied to each component of the target service prior to install. The simplest way to obtain the required configuration files is to install the service via the Ambari UI and then use the Ambari Config utility located at: /var/lib/ambari-server/resources/scripts/configs.sh. This utility will use the REST API to pull the configurations that Ambari requires from and existing service. In the case of an already installed NIFI service it is possible to to get the required configuration files as follows: In Ambari UI, click on the NIFI service and then navigate to the configuration section Take note of each of the configuration sections listed. Use the Ambari Configs utility to export each of the configuration sections to file as follows: (not shown below... script takes a username and password parameter, this should be the Ambari admin user) /var/lib/ambari-server/resources/scripts/configs.sh get sandbox.hortonworks.com Sandbox nifi-ambari-config >> nifi-ambari-config.json
/var/lib/ambari-server/resources/scripts/configs.sh get sandbox.hortonworks.com Sandbox nifi-bootstrap-env >> nifi-bootstrap-env.json
/var/lib/ambari-server/resources/scripts/configs.sh get sandbox.hortonworks.com Sandbox nifi-flow-env /root/CreditCardTransactionMonitor/Nifi/config/nifi-flow-env.json
/var/lib/ambari-server/resources/scripts/configs.sh get sandbox.hortonworks.com Sandbox nifi-logback-env >> nifi-logback-env.json
/var/lib/ambari-server/resources/scripts/configs.sh get sandbox.hortonworks.com Sandbox nifi-properties-env >> nifi-properties-env.json Make sure to strip out the header that gets added to these files (########## Performing 'GET' on (Site:nifi-ambari-config, Tag:version1461337652473983733)). These configuration definitions can now be used to complete the automation the NIFI installation as follow: /var/lib/ambari-server/resources/scripts/configs.sh set sandbox.hortonworks.com Sandbox nifi-ambari-config >> nifi-ambari-config.json
/var/lib/ambari-server/resources/scripts/configs.sh set sandbox.hortonworks.com Sandbox nifi-bootstrap-env >> nifi-bootstrap-env.json
/var/lib/ambari-server/resources/scripts/configs.sh set sandbox.hortonworks.com Sandbox nifi-flow-env >> nifi-flow-env.json
/var/lib/ambari-server/resources/scripts/configs.sh set sandbox.hortonworks.com Sandbox nifi-logback-env nifi-logback-env.json
/var/lib/ambari-server/resources/scripts/configs.sh set sandbox.hortonworks.com Sandbox nifi-properties-env nifi-properties-env.json 5. Add Role to Member Hosts Since HDP is a distributed platform, most services will be installed across multiple hosts. Each host may host different components or all of the same components. Which component of the service run on which service must be defined prior to install. In this case, we are assigning the NIFI-MASTER role to the same host where Ambari server is running. curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/hosts/sandbox.hortonworks.com/host_components/NIFI_MASTER 6. The NIFI service is now ready to be installed curl -u admin:admin -H "X-Requested-By:ambari" -i -X PUT -d '{"RequestInfo": {"context" :"Install Nifi"}, "Body": {"ServiceInfo": {"maintenance_state" : "OFF", "state": "INSTALLED"}}}' http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/NIFI This request will return a task id as one of the return parameters. This task may take a while and is asynchronous, thus it is necessary to get a handle on the task, periodically check to see if it has been completed before, and loop/sleep until the task is complete. Check for task status as follows: curl -u admin:admin -X GET http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/requests/$TASKID Once the task comes back as COMPLETE, the NIFI service has been installed and is ready to be started. 7. Start the Service curl -u admin:admin -H "X-Requested-By:ambari" -i -X PUT -d '{"RequestInfo": {"context" :"Start NIFI"}, "Body": {"ServiceInfo": {"maintenance_state" : "OFF", "state": "STARTED"}}}' http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/NIFI The NIFI service is now up and running and ready to build data flows. This same approach can be used to install, stop, start, and change configuration for any service in Ambari.
... View more
Labels:
04-28-2016
07:49 PM
3 Kudos
Modern data driven applications require a "Connected Platform" capable of bringing data in and out to/from the Internet of Things, mobile users, and social media in real time. In order to monetize all of that real time data the platform must have the ability to process Petabytes of data to create adaptive learning algorithms and apply those algorithms in real time as the data streams in and out of the platform. However, the modern data application cannot be effectively utilized or operated without an application tier that allows the business to visualize, interact, and act on the massive volumes of data and insight coming in and out in real time and accumulating in storage. The Hortonworks Connected Platform "HDP+HDF" has the capability to act as a PaaS that can host the application tier of the modern data application along side of all of the data processing. It is possible to use Slider to run a dockerized application managed by Yarn inside of the Hadoop cluster similar to an application PaaS. This can be accomplished as follows: 1. Create a web application project using that includes the application server embedded in the package. The resulting package should be runnable something like a Java runnable jar. This can be accomplished using Maven. Here is an example oft he application packaging: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>BigData</groupId>
<artifactId>ShopFloorUI</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<docker.registry.name></docker.registry.name>
<docker.repository.name>${docker.registry.name}vvaks/biologicsmanufacturingui</doc ker.repository.name>
<tomcat.version>7.0.57</tomcat.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.tomcat.embed</groupId>
<artifactId>tomcat-embed-core</artifactId>
<version>${tomcat.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tomcat.embed</groupId>
<artifactId>tomcat-embed-logging-juli</artifactId>
<version>${tomcat.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tomcat.embed</groupId>
<artifactId>tomcat-embed-jasper</artifactId>
<version>${tomcat.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tomcat</groupId>
<artifactId>tomcat-jasper</artifactId>
<version>${tomcat.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tomcat</groupId>
<artifactId>tomcat-jasper-el</artifactId>
<version>${tomcat.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tomcat</groupId>
<artifactId>tomcat-jsp-api</artifactId>
<version>${tomcat.version}</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-client</artifactId>
<version>9.3.6.v20151106</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>9.3.6.v20151106</version>
</dependency>
<dependency>
<groupId>org.cometd.java</groupId>
<artifactId>cometd-api</artifactId>
<version>1.1.5</version>
</dependency>
<dependency>
<groupId>org.cometd.java</groupId>
<artifactId>cometd-java-client</artifactId>
<version>3.0.7</version>
</dependency>
<dependency>
<groupId>javax.el</groupId>
<artifactId>el-api</artifactId>
<version>2.2</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>javax.servlet.jsp</groupId>
<artifactId>jsp-api</artifactId>
<version>2.2</version>
</dependency>
<dependency>
<groupId>javax.servlet.jsp.jstl</groupId>
<artifactId>javax.servlet.jsp.jstl-api</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.13</version>
</dependency>
</dependencies>
<build>
<finalName>${project.artifactId}</finalName>
<sourceDirectory>src/</sourceDirectory>
<resources>
<resource>
<directory>src/main/webapp</directory>
<targetPath>META-INF/resources</targetPath>
</resource>
<resource>
<directory>src/main/resources</directory>
<targetPath>META-INF/resources</targetPath>
</resource>
</resources>
<outputDirectory>classes/</outputDirectory>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.hortonworks.iot.shopfloorui.ShopFloorUIMain</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-war-plugin</artifactId>
<version>2.1.1</version>
<configuration>
<webappDirectory>webapp/</webappDirectory>
<finalName>ShopFloorUI</finalName>
</configuration>
</plugin>
<plugin>
<groupId>org.jolokia</groupId>
<artifactId>docker-maven-plugin</artifactId>
<version>0.13.3</version>
<configuration>
<images>
<image>
<!-- <alias>${project.artifactId}</alias>
<name>${docker.repository.name}:${project.version}</name> -->
<alias>biologicsmanufacturingui</alias>
<name>${docker.repository.name}</name>
<build>
<from>java:8-jre</from>
<maintainer>vvaks</maintainer>
<assembly>
<descriptor>docker-assembly.xml</descriptor>
</assembly>
<ports>
<port>8090</port>
</ports>
<cmd>
<shell>java -jar \
/maven/ShopFloorUI-jar-with-dependencies.jar server \
/maven/docker-config.yml</shell>
</cmd>
</build>
</image>
</images>
</configuration>
</plugin>
</plugins>
</build>
</project>
2. Create a docker container with that contains the runnable package and has the command to start that package on startup. Maven has a docker maven plugin that automates the creation of a docker container using the runnable created by the Maven assembly plugin. In order for the plugin to work, docker must be installed and assessable by the Eclipse session. 3. Create an account on https://hub.docker.com/. You will need this account to publish the docker container that you created locally. This is important as the Slider client will attempt to download the docker container from the docker hub not from the local repository. This is essential since otherwise it would necessary to distribute the docker container to every single node in the cluster since Yarn can decide to start it up using any node manager agent. 4. Create the Slider configuration files: appConfig.json - This file contains the command that the node manager will execute to start the docker container locally as well as the command to run periodically to check the health of the container. The example below starts two docker containers, one called MAPUI and another called COMETD. {
"schema": "http://example.org/specification/v2.0.0",
"metadata": {},
"global": {},
"components": {
"MAPUI": {
"mapui.commandPath": "/usr/bin/docker",
"mapui.options":"-d --net=host",
"mapui.statusCommand":"docker inspect -f {{.State.Running}} ${CONTAINER_ID} | grep true"
},
"COMETD": {
"cometd.commandPath": "/usr/bin/docker",
"cometd.options":"-d --net=host",
"cometd.statusCommand":"docker inspect -f {{.State.Running}} ${CONTAINER_ID} | grep true"
}
}
}
metainfo.json - This file contains the image to download from docker hub as well as the ports that the container is listening on. The component names must match up across all three configuration files. {
"schemaVersion": "2.1",
"application": {
"name": "MAPUI",
"components": [
{
"name": "MAPUI",
"type": "docker",
"dockerContainers": [
{
"name": "mapui",
"commandPath": "/usr/bin/docker",
"image": "vvaks/mapui",
"ports": [{"containerPort" : "8091", "hostPort" : "8091"}]
}
]
},
{
"name": "COMETD",
"type": "docker",
"dockerContainers": [
{
"name": "cometd",
"commandPath": "/usr/bin/docker",
"image": "vvaks/cometd",
"ports": [{"containerPort" : "8090", "hostPort" : "8090"}]
}
]
}
]
}
}
resources.json - This file contains the resources required by the application. Slider will use these specifications to request the required resources from Yarn. The component names must match up across all three configuration files. {
"schema": "http://example.org/specification/v2.0.0",
"metadata": { },
"global": { },
"components": {"slider-appmaster": { },
"MAPUI": {
"yarn.role.priority": "1",
"yarn.component.instances": "1",
"yarn.memory": "256"
},
"COMETD": {
"yarn.role.priority": "2",
"yarn.component.instances": "1",
"yarn.memory": "256"
}
}
}
5. Make sure that a Slider client is available on the host from which you will launch the request and that the Slider client is configured to point at the target Yarn cluster's Resource Manager. slider create mapui --template /home/docker/dockerbuild/mapui/appConfig.json --metainfo /home/docker/dockerbuild/mapui/metainfo.json --resources /home/docker/dockerbuild/mapui/resources.json Slider will reach out to Yarn, request the containers specified in resources.json and then instruct Yarn to run the command specified in appInfo.json with the details specified in metainfo.json. At this point you should see the application listed as a Slider type application in Yarn Resource Manager UI. You should be able to click on the application link and view the logs being generated by the containers as the application starts up. Of course, Docker must be installed and running on the nodes that make up the queue where slider will request the application to start. It should be noted that this approach does not solve all of the problems that a PaaS does. The issue of application instance registry still has to be dealt with. There is no, out of the box approach, that allows discovery and routing of the client to the application after it starts or upon container failure. The following link addresses how to deal with this issue: https://slider.incubator.apache.org/design/registry/a_YARN_service_registry.html All of these issues will be solved by the Yarn.Next initiative. The HDP engineering team is hard at work making this happen. Yarn.Next will embedded all of the capabilities described above as part of core Yarn. This will allow the creation of a Modern Data Application, including all components like Storm, HBase, and the Application tier by simply providing Yarn with a JSON descriptor. The application start with all of the required components pre-integrated and discoverable via standard DNS resolution. Stay tuned for the next installment. For working examples, check out these Repos. Each of these is a working example of a modern data application running on the Hortonworks Connected Platform, including the application tier. https://community.hortonworks.com/content/repo/27236/credit-fraud-prevention-demo.html https://community.hortonworks.com/content/repo/29196/biologics-manufacturing-optimization-demo.html https://community.hortonworks.com/content/repo/26288/telecom-predictive-maintenance.html
... View more
Labels: