Member since
02-23-2016
51
Posts
96
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1211 | 05-25-2016 04:42 PM | |
2200 | 05-16-2016 01:09 PM | |
802 | 04-27-2016 05:40 PM | |
3447 | 02-26-2016 02:14 PM |
03-29-2018
01:49 PM
15 Kudos
An article on the challenges and solutions to predicting machine failures in the field. The full details can be found here: https://github.com/kirkhas/zeppelin-notebooks/tree/master/Preventive_maintenance Step #1 Feature Selection Step #2 Geolocation Step #3 - Scythe is a time-series library authored by Kirk Haslbeck for these purposes - Needed to Resample the data into trips or route segments (Scythe Resample) - Needed to Step Interpolate the miles since last service to be 4K, 5K and less continuous regression Step #4 - Indexing and OneHotEncoding to the Rescue. Found a relationship of a particular "Make" that was more problematic than most. Roc Curve - A near perfect model
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- data-science
- How-ToTutorial
- maintenance
- preventive
- Spark
- zeppelin-notebook
03-29-2018
01:39 PM
Repo Description Preventive Maintenance Data Science Use-Case for Fleet Cost Avoidance. Repo Info Github Repo URL https://github.com/kirkhas/zeppelin-notebooks/tree/master/Preventive_maintenance Github account name kirkhas Repo name Preventive_maintenance
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- data-science
- maintenance
- preventive
- Spark
- zeppelin-notebook
01-31-2017
04:11 PM
1 Kudo
Updating this thread. Hive has primary and foreign keys for metadata and query optimization. https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive ALTER TABLE TABLENAME ADD CONSTRAINT COLNAME_PK PRIMARY KEY (CS_ID);
ALTER TABLE TABLENAME ADD CONSTRAINT COLNAME_FK1 FOREIGN KEY (TBL_ID) REFERENCES TBLS
... View more
01-27-2017
06:39 PM
Is there another method or workaround that can replace the "transform" method. Or suggested usage to resolve the error below. select transform(host, ip) using 'python parse_mro.py' as (host string, ip string) from table1; Error: Error while processing statement: FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.security.authorization.plugin.HiveAccessControlException(Query with transform clause is disallowed in current configuration.) (state=08S01,code=12)
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Ranger
11-18-2016
02:19 PM
What about Ranger, can that provide protection at this level? Assuming data does get removed any recovery options?
... View more
11-18-2016
01:23 PM
What are the best recovery options if a product like Abinitio runs an m_rm command that deletes the HDFS data in one of the environments. These type of low level executions by-pass the Hadoop dfs rm command that puts the deleted data in the trash folder for recover. The Trash Interval is Configured for 21 Days in the Hortonworks Environment. Data had to be recreated from the source files, but if this were prod what are the best recovery options?
... View more
Labels:
- Labels:
-
Apache Hadoop
09-14-2016
01:33 PM
1 Kudo
When running hive 1.2.1 on HDP 2.4 Hive successfully connects to the metastore and then later drops the connection. It seems like an exception is throwing after it successfully connects to metastore. I noticed if we turn off the CBO settings it will by pass the metastore and skip this exception. We are using ORC and have run compute stats. 2016-09-07 04:44:08,643 INFO [main]: hive.metastore (HiveMetaStoreClient.java:isCompatibleWith(296)) - Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook
2016-09-07 04:44:08,647 INFO [main]: hive.metastore (HiveMetaStoreClient.java:open(382)) - Trying to connect to metastore with URI INFO [main]: hive.metastore (HiveMetaStoreClient.java:open(478)) - Connected to metastore. -- 2016-09-07 04:44:08,647 INFO [main]: hive.metastore (HiveMetaStoreClient.java:open(382)) - Trying to connect to metastore with URI 2016-09-07 04:44:08,649 INFO [main]: hive.metastore (HiveMetaStoreClient.java:open(478)) - Connected to metastore.
2016-09-07 04:44:08,664 INFO [main]: Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1173)) - mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
2016-09-07 04:44:08,729 WARN [main]: metastore.RetryingMetaStoreClient (RetryingMetaStoreClient.java:invoke(184)) - MetaStoreClient lost connection. Attempting to reconnect.
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_aggr_stats_for(ThriftHiveMetastore.java:3033) 2016-09-07 04:44:13,784 WARN [main]: metastore.RetryingMetaStoreClient (RetryingMetaStoreClient.java:invoke(184)) - MetaStoreClient lost connection. Attempting to reconnect.
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
... View more
Labels:
- Labels:
-
Apache Hive
09-13-2016
02:50 PM
4 Kudos
How will Spark designate resources in spark 1.6.1+ when using num-executors? This question comes up a lot so I wanted to use a baseline example. On an 8 node cluster ( 2 name nodes) (1 edge node) (5 worker nodes). Each worker node having 20 cores and 256G. if num-executors = 5 will you get 5 total executors or 5 on each node? Table below for illustration. cores executors per node executors total 25 5 25
... View more
Labels:
- Labels:
-
Apache Spark
08-29-2016
06:43 PM
Can it run both Spark 1.6.1 and Spark 2.0 or just Spark 2.0 ?
... View more
08-24-2016
08:04 PM
7 Kudos
Brandon Wilson has a great article that shows how to use the "CACHE TABLE" cmd in Tableau, however more recent drivers have come out and you can now connect directly to the thriftserver using a spark-sql driver. This is using HDP 2.5 and SimbaSparkOdbc. First pull up a Tableau connection and select the thriftServer. Additionally had to open the virtualbox port 10015. Next if you don't have the driver Tableau will jump you to a page where you can download a spark-sql driver and inside that package chose this driver. Once you establish a valid connection you will see Tableau flag the connects based on the driver. Below you will see the Hive connection from Brandon's article and now the new Spark connection. Next using the CACHE cmd enter the below into Tableau's initial SQL box. Finally check the storage of spark for the warehouse/crimes table in memory. Or any table of your chosing for that matter. Some visuals from Tableau.
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- How-ToTutorial
- ingestion
- odbc
- Spark
- spark-sql
- SparkSQL
- tableau
- thrift
Labels:
07-09-2016
01:29 AM
1 Kudo
Repo Description This zeppelin dashboard demonstrates how to map all the data types from HAWQ to HIVE using sqoop. It uses postgress sql to create the HAWQ table and fills in 1 col for every data type. The more significant piece shown here is how to map the data types that differ from HAWQ to HIVE. For exmaple a boolean column in HAWQ exports as t or f but that is not compatible with HIVE. Using postgress and sqoop this converts to TRUE and FALSE which is accepted by HIVE. Repo Info Github Repo URL https://github.com/kirkhas/zeppelin-notebooks/tree/master/HAWQ-Sqoop Github account name kirkhas/zeppelin-notebooks/tree/master Repo name HAWQ-Sqoop
... View more
- Find more articles tagged with:
- ambari-extensions
- Data Ingestion & Streaming
- hawq
- Hive
- Sqoop
Labels:
07-06-2016
08:39 PM
3 Kudos
Repo Description Why create yet another VaR example? To demonstrate VaR running on a modern architecture that has no vertical limit. This is a functional, immutable, scaleable interpretation of a basic technique commonly used in finance. zephub link https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9Nb250ZUNhcmxvVmFyL25vdGUuanNvbg Repo Info Github Repo URL https://github.com/kirkhas/zeppelin-notebooks/tree/master/MonteCarloVar Github account name kirkhas/zeppelin-notebooks/tree/master Repo name MonteCarloVar
... View more
- Find more articles tagged with:
- ambari-extensions
- Data Science & Advanced Analytics
- data-science
- finance
- Spark
07-06-2016
04:24 PM
4 Kudos
Query JSON using Spark
Imagine you are ingesting JSON msgs but each one has different tag names or even a different structure. This is very common because JSON is a flexible nested structure. However we commonly interact with data in a flat table like structure using SQL. The decision becomes to either parse the dynamic data into a physical schema (on write) or apply a schema at runtime (on read). Ultimately the decision will likely be made based on the number of writes vs reads. However there is one major advantage to using Spark to apply schema on read to JSON events, it alleviates the parsing step. Typically you have to hand code all the tags in the JSON msgs and map each one to a schema column. This may require meeting with upstream teams or third parties to get the DDL/xsd or schema definition. It also doesn't protect you from msgs you haven't seen or new tags being added to existing JSON structures. Sparks schema on read handles all of this as well as flattens the structure into a SQL queryable table. In the example below there are 3 different JSON msgs each with different tags and structures. If the goal is to normalize the data for a specific reporting or data science task you may be better off defining a physical schema where items like price and strikePrice are converged to a common column that makes sense in both contexts. However if your goal is to process or serve msgs like a msg bus, or if you find that it is better to query stocks separately from options because the attributes should not be interpreted and you do not want to become the author of the data you are processing then this could be an ideal approach. (A non-authoritative, low maintenance approach that is queryable) {"tradeId":"123", "assetClass":"stock", "transType":"buy", "price":"22.34",
"stockAttributes":{
"5avg":"20.12","52weekHi":"27.56"
}
}
{"tradeId":"456", "assetClass":"future", "transType":"sell", "strikePrice":"40.00",
"contractType": "forward",
"account":{
"city":"Columbus","state":"Ohio", "zip":"21000"
}
}
{"tradeId":"789", "assetClass":"option", "transType":"buy", "strikePrice":"35.75",
"account":{
"accountType":"retail","city":"Columbus","state":"Ohio"
}
}
1.0 The below image shows the 3 different JSON msgs (stock,option,future) with different attributes and structures.
2.0 Here you can query all of the data or any segment of the data using SQL.
Full code on zephub - code link
Pros: Data tags and structure are always in sync with provider No data loss No parsing layer (code effort), faster time to market No authoring, naming or defining columns Cons: SQL reads will be slower than a physically flattened and written table Deserialization cost and can't benefit from modern day columnar operations Compression - "don't use JSON" video from summit https://www.youtube.com/watch?v=tB28rPTvRiI&feature=youtu.be&t=20m3s
... View more
- Find more articles tagged with:
- Data Processing
- flatten
- Hive
- How-ToTutorial
- json
- Spark
- spark-sql
Labels:
06-10-2016
10:45 PM
16 Kudos
Predict Stock Portfolio Gains Using Monte Carlo Why?
Why create yet another VaR example? To demonstrate VaR running on a modern architecture that has no vertical limit. This is a functional, immutable, scaleable interpretation of a basic technique commonly used in finance. Code Available here and on github. https://github.com/kirkhas/zeppelin-notebooks/
link to Vlad's article for history of Monte Carlo and VaR - https://community.hortonworks.com/articles/36321/predicting-stock-portfolio-losses-using-monte-carl.html Some modifications from original posting include: scala calling Yahoo API directly, alleviating the need for shell scripting and adding interopability between variables. All data loaded dynamically in memory, removing the need to store files (which inherently adds manual customizations to a generic process). Code all in Zeppelin for readability. Visualizations in Zeppelin. Inputs built in using Zep forms so the user can interact with the model. Percentiles not only on what's at risk each day but also on final portfolio value.
Figure 1.0 shows the risk you would take on per each day holding these 3 stocks.
Figure 2.0 shows what a reasonable projected outcome might be after holding this position for 100 days.
Checkout the code it has a lot more visuals. Key takeaway: "You should have purchased shares of HDP in mid Feb 2016!"
Code View
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9Nb250ZUNhcmxvVmFyL25vdGUuanNvbg
Report View
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9Nb250ZUNhcmxvVmFyL1JlcG9ydFZpZXcvbm90ZS5qc29u
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- data-science
- finance
- How-ToTutorial
- Spark
- spark-1.6.0
- spark-sql
- zeppelin
05-26-2016
07:41 PM
1 Kudo
Thanks all @Artem Ervits @Tom McCuch for the comments. I did get it resolved by passing all the S3 jars properly on the classpath. The articles included in your threads helped.
... View more
05-26-2016
01:02 PM
Unable to execute the queries on S3 data using SPARK and PYSPARK. It is throwing below error. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) …. …. Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) we have tried it by adding below parameters but no luck. Parameter name: fs.s3a.impl Parameter value: org.apache.hadoop.fs.s3a.S3AFileSystem Added this paramter in hdfs.site.xml, core-site.xml, hive-site.xml and also added the aws jar files in mapred-site.xml (added to classpath)files.
... View more
Labels:
- Labels:
-
Apache Spark
05-25-2016
04:42 PM
The final resolution was this: Ambari was only showing a "SERVER ERROR" msg on final step with no stack trace. After reading the log I saw there was a primary key constraint on the table "clusterservices". remove this row from table. Then re-install via ambari and it was successful. My hunch is that we got into this state by first trying to remove or edit a service that was already running.
... View more
05-25-2016
03:15 PM
@Artem Ervits looks like you were right, once I got a hold of the logs looks like they did not stop the service first. 20 May 2016 11:05:21,732 INFO [ambari-heartbeat-processor-0] HeartbeatProcessor:603 - State of service component NODEMANAGER of service YARN of cluster HDPPOC2 has changed from UNKNOWN to STARTED at host ip-10-228-210-131 according to STATUS_COMMAND report
20 May 2016 11:05:33,535 ERROR [qtp-ambari-client-41] AbstractResourceProvider:338 - Caught AmbariException when modifying a resource
org.apache.ambari.server.AmbariException: Cannot remove ZEPPELIN. Desired state STARTED is not removable. Service must be stopped or disabled.
at org.apache.ambari.server.controller.internal.ServiceResourceProvider.deleteServices(ServiceResourceProvider.java:869)
at org.apache.ambari.server.controller.internal.ServiceResourceProvider$3.invoke(ServiceResourceProvider.java:247)
at org.apache.ambari.server.controller.internal.ServiceResourceProvider$3.invoke(ServiceResourceProvider.java:244)
at org.apache.ambari.server.controller.internal.AbstractResourceProvider.invokeWithRetry(AbstractResourceProvider.java:450) 20 May 2016 11:06:52,501 ERROR [qtp-ambari-client-42] AmbariJpaLocalTxnInterceptor:180 - [DETAILED ERROR] Rollback reason:
Local Exception Stack:
Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.6.2.v20151217-774c696): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: org.postgresql.util.PSQLException: ERROR: update or delete on table "servicecomponentdesiredstate" violates foreign key constraint "hstcmpnntdesiredstatecmpnntnme" on table "hostcomponentdesiredstate"
Detail: Key (component_name,cluster_id,service_name)=(ZEPPELIN_MASTER,2,ZEPPELIN) is still referenced from table "hostcomponentdesiredstate".
Error Code: 0
Call: DELETE FROM servicecomponentdesiredstate WHERE (((cluster_id = ?) AND (component_name = ?)) AND (service_name = ?))
bind => [3 parameters bound]
at org.eclipse.persistence.exceptions.DatabaseException.sqlException(DatabaseException.java:340)
... View more
05-23-2016
03:35 PM
Removed all components related to Zeppelin from Ambari, and tried to reinstall again but everytime its failing with the error "Server error." used this cmd to remove the service. is there something else that required cleaning up? curl -u admin:admin -X DELETE -H 'X-Requested-By:1' http://10.228.210.175:80/api/v1/clusters/HDPPOC2/services/ZEPPELIN
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Zeppelin
05-16-2016
01:09 PM
5 Kudos
Results from using sqoop to move data from HAWQ to HIVE. @Artem Ervits and @cstanca HAWQ Hive Result int int worked text string worked date string write=string, onRead date operations work timestamp string write=string, onRead ts operations work bit boolean conversion does not work decimal double mostly works, precision loss > 9
double precision double works real double works interval Breaks! sqoop mapping error bit varying Breaks! sqoop mapping error time string write=string, onRead time operations work char string write=string, onRead you need wildcard expression, recommend trimming char varying string write=string, onRead holds whitespace, recommend trimming varchar string works boolean boolean works numeric double works %sh
sqoop import --username zeppelin --password zeppelin --connect jdbc:postgresql://jdbcurl --query 'SELECT id,name,join_date,age,a,b,i FROM kirk WHERE $CONDITIONS' -m 1 --target-dir /user/zeppelin/kirk/t6 --map-column-java a=String,i=String,b=String -- select *
select * from kirk ;
-- int check between inclusive
select age from kirk where age between 25 and 27;
-- decimal check
select dec from kirk where dec > 33.32;
-- string like and wildcard
select address from kirk where address like '%Rich%';
-- date is a string but operates like date
select join_date from kirk where join_date between '2007-12-13' and '2007-12-15';
-- timestamp, works string on write but operates like TS
select ts from kirk where ts > '2016-02-22 08:01:22'
-- BIT NOT CORRECT
select a from kirk where a =false or a = 1
-- character varying, without white space matches
select cv from kirk where cv = 'sdfsadf';
-- character varying, with white space
select cv from kirk where cv = 'white space'; -- not matching
select cv from kirk where cv = 'white space '; -- matching
-- character, doesn't match unless wildcard
select c from kirk where c like 'we%';
-- boolean, both true/false and 1/0 are converted properly
select id, isactive from kirk where isactive = true or isactive = 0
... View more
05-12-2016
05:19 PM
1 Kudo
Great feature @bbende . This is much easier to visualize now.
... View more
05-12-2016
05:12 PM
3 Kudos
If the case where NiFi is reading from 30 database tables in a single flow what is the best way to visually identify which processor is connecting to each database and table?
... View more
Labels:
- Labels:
-
Apache NiFi
05-11-2016
01:55 PM
7 Kudos
You can now visualize any Zeppelin notebook using Zeppelinhub viewer. https://www.zeppelinhub.com/viewer personal likes: 1. No need to sign up or register just paste a link 2. I've been posting my zeppelin notebooks to github but everyone that wants to visualize them or interact with them needs to download, move to environment, import into their instance of zeppelin. Not anymore just paste the link. 3. Less of a need to take screenshots and create a powerpoint just send the hyperlink examples: Stock Variance Notebook github - https://github.com/kirkhas/zeppelin-notebooks/blob/master/stock-variance/note.json vs Stock Variance Notebook zephub - https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9zdG9jay12YXJpYW5jZS9ub3RlLmpzb24 Credit Card Fraud Transactions git - https://github.com/vakshorton/CreditCardTransactionMonitor/blob/master/Zeppelin/notebook/2BGDWYZV9/note.json vs https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3Zha3Nob3J0b24vQ3JlZGl0Q2FyZFRyYW5zYWN0aW9uTW9uaXRvci9tYXN0ZXIvWmVwcGVsaW4vbm90ZWJvb2svMkJHRFdZWlY5L25vdGUuanNvbg
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- FAQ
- How-ToTutorial
- python
- Spark
- visualization
- zeppelin
- zeppelin-notebook
Labels:
05-11-2016
01:55 PM
1 Kudo
In addition to the HWX install guides online, this is a great best practices article for groups that want to consider some design options prior to install. http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger/
... View more
- Find more articles tagged with:
- authorization
- best-practices
- HDFS
- how-to-tutorial
- How-ToTutorial
- Ranger
- Security
Labels:
05-11-2016
01:04 PM
1 Kudo
Repo Description Zeppelin Notebook Stock Variance Example check it out on zeppelinhubview
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2tpcmtoYXMvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci9zdG9jay12YXJpYW5jZS9ub3RlLmpzb24 Repo Info Github Repo URL https://github.com/kirkhas/zeppelin-notebooks Github account name kirkhas Repo name zeppelin-notebooks
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- Spark
- zeppelin
- zeppelin-notebook
Labels:
04-30-2016
03:05 AM
Vadim and I at hortonworks have a packaged demo of exactly what I described above on github. data streaming through NiFi into a spark model at this link https://github.com/vakshorton
... View more
04-30-2016
03:03 AM
1 Kudo
The more traditional approach in this situation is to use NiFi to read the incoming data and then add a NiFi processor to dump the data from the NiFi queue to either Storm or in your case SparkStreaming. Now you can build a Spark ML model test it and run it in Spark. You can push logic into NiFi but an ML model inside NiFi is overkill.
... View more
04-27-2016
06:18 PM
1 Kudo
In the latest version of Hortonworks sandbox 2.4 (you can download for free from hortonworks.com) zeppelin and spark run out of the box. Spark version is 1.6 and the toJSON method works, make sure you run it on a DataFrame not RDD val js = tradesRDD.toDF.toJSON
js.take(2)
output:
Array[String] =
Array({"trader":"Kirk","price":11.0,"qty":51,"vol":40000,"product":"goog","time":"2016-03-29
10:38:12.0"},
{"trader":"Kirk","price":0.0,"qty":66,"vol":40000,"product":"goog","time":"2016-03-29
10:56:12.0"})
... View more
04-27-2016
05:40 PM
2 Kudos
This is the common process many go through and many ways to skin the cat here. I prefer the below methodology. 1. Bring in the data with minimal transformation the "E" and "L". Depending on workload this could be sqoop for simple batch or NiFi for a more modern streaming approach with better control over flow, bi-direction and back pressure. 2. Decide on a transformation strategy and store a higher level or "enriched" data set typically in Hive or HBase. Now between Atlas and NiFi you should have some data lineage. Other formatting might take place here with native datatypes dates vs timestamps. Likely a partitioning strategy would take place here. Running a data cleansing strategy at this phase is also a good idea as well as computing feature vectors. 3. Use zeppelin + spark to analyze the data.
... View more
04-27-2016
03:07 PM
1 Kudo
After playing with the Spark 1.6 LinearRegression model I found it is very sensitive to the StepSize. What is the best practice around tuning this parameter? The mean squared error of the model I build varies greatly depending on this input. // Building the model
val numIterations = 30
val stepSize = 0.0001
val linearModel = LinearRegressionWithSGD.train(trainingDataRDD, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = trainingDataRDD .map { point =>
val prediction = linearModel.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + NumberFormat.getInstance().format(MSE) )
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache Zeppelin