About bleonhardi

bleonhardi · ‎05-18-2016

Its Hadoop you can do whatever you want and are more comfortable in. Functionally Spark, Pig, and Hive are equivalent and performance is also very close ( if you use tez for pig and hive ), complex queries will be much better in Hive, any transformations that require a lot of distinct steps with a lot of data being kept in memory is a strong suit of Spark. But all in all it depends more on what you are comfortable with and what kind of data prep you want to do. Lots of people know SQL they should use Hive, Lots of people like pig because its well integrated with oozie and very mature its also really easy to write UDFs. Spark is a bit less stable and mature but has a ton of addons and you can rapidly program functions in Scala or python if you are so inclined. All of them can read and write to hive tables or two and from unstructured files. (hive being better at the first, pig/spark better at the latter ). Choose your poison.

bleonhardi · ‎05-18-2016

Yup possible. Remove the VALUES, remove the FROM: Syntax is: INSERT INTO TABLE xxx partiton ( xxx ) SELECT xxx; You don't need to specify any columns or data types. It just needs to fit to your target table. ( all columns need to match, the only potential issue is the partition column. For example if you specify a specific partition you cannot have the partition column in the select clause, if you specify dynamic partitioning you need the partition columns ) You mixed up the from query and values parts. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

bleonhardi · ‎05-18-2016

You would use Spark for : 1 Data Preparation and Aggregation data preparation, cleansing and aggregation ( like you would use pig/hive/mapreduce). Then easiest save the aggregated table into Hive tables and access them with your analytical tool. As an example: Have all transactions in hive, crunch daily transactions into an aggregation table and export that into tableau. 2 Advanced Analytics Spark provides advanced analytics like Spark MLib and GraphX 3 Analytics on all data Advanced Analysts can use spark directly for example out of Zeppelin to run queries directly on the full dataset. That may not be as comfortable as their usual tool in 1), however you can run the queries on the full data set. 4 Streaming

bleonhardi · ‎05-17-2016

Hello Elan, not sure about the question. PAM will authenticate against any linux user with the linux password. So is user1/user1 a valid linux user? You don't need to do anything with the metastore. The user needs a home directory in hdfs however

bleonhardi · ‎05-16-2016

Get data into the cluster? Easiest way is to have a delimited file and do hadoop fs -put file <hdfs location> You can then read those files with sc.textFile. You should go through a couple of basic tutorials I think to work with hadoop: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

bleonhardi · ‎05-16-2016

One thing is that Hive on Tez is in general significantly faster than Spark SQL. It has had the basic efficiencies that Spark has just added ( Tungsten ... ) for a long time. However I see some problems in your configuration too: Hive/Tez can take the whole cluster. Which has 30*250GB RAM = 7500GB. You only give 40*4 = 160GB to Spark. 40 executors on 30 nodes do not make any sense at all in any case. Do 30 or 60 or 90 or 120. The best practice was to make executors large but not too large. 16-32GB is a good size apparently. Less results in more inter process overhead more results in GC problems. Do you know how many map/reduce tasks you get when the query is run in Tez? You should give Spark at least the same resources to be fair. Also play around with the cores Spark uses them to decide parallellity. Also Spark may have some problems with Partitioning/Predicate Pushdown features Hive/Tez supports. ( Not sure about the current state of support for these or if you use them ) Finally There is the question of optimization. A wrong join type and your query can be 100x slower or faster. The joy of comparing database performance. Hard to give general tips here.

bleonhardi · ‎05-16-2016

Sounds good to me: - Loading data in HDFS ( potentially use pig to fix some formatting issues ) - Load data into Hive for some pre analysis and data understanding ( perhaps together with Eclipse Data tools or Ambari views ) - split data into Test and training datasets ( using some randomization function in Hive/pig or directly in Spark ) - Run analysis in Spark MLIB ( good choice ) - Investigate results with visualization tools ( Zeppelin works nice ) rerun modelling as needed. - Export final results into hive/external database and use a decent BI tool like Tableau ( or if you want it free BIRT/Pentaho ) to visualize results. Sounds like a very good basic workflow

bleonhardi · ‎05-16-2016

Hello David , Does it have to be Mahout? In general spark mlib is just quote a bit "cooler" now. Here is the Web page of it with an example code. ( If it has to be Mahout I am sure someone can help too) http://spark.apache.org/docs/latest/mllib-clustering.html Regarding Mahout I suppose you found that one already: https://mahout.apache.org/users/clustering/k-means-clustering.html

bleonhardi · ‎05-16-2016

1) yes you can see the "Tez session was closed ... 2) In anything after HDP2 tez is enabled by default. MapReduce might be going away as an option anyway 3) You can still use set execution engine in queries set hive.execution.engine=mr or tez 4) Not sure what you mean with utiliy. The Tez view in ambari would provide the functionality I am not completely sure about the out of the box integration with resource manager https://www.youtube.com/watch?v=xyqct59LxLY

bleonhardi · ‎05-16-2016

Essentially column families should have the same keys. If you want to use two different keys you need two tables. So I think you should have two tables, one keyed by account|cust as you say to find the customer info for an account and a separate table that is cust|account so you can easily drill down to a customer and find all the accounts associated with it. You can also do the second table with cust as key and then an array of accounts as you say but then you always need to update the list of accoiunts at a time. If you key the second table by cust|account you can freely add delete account rows for a customer and do a scan to get all accounts.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Spark to divide data into more smaller cluster...

Re: It is possible insert data into a table A from...

Re: Spark to divide data into more smaller cluster...

Re: Using Hive with PAM Authentication

Re: Apache Mahout K-Means Algorithm in HDP 2.4 on ...

Re: Spark SQL query execution is very very slow wh...

Re: Methodology to apply Data Mining in Big Data

Re: Apache Mahout K-Means Algorithm in HDP 2.4 on ...

Re: Enable/disable Tez and verify

Re: Optimal way of defining HBASE column family