About gkeys

gkeys · ‎10-03-2016

@Fred Schwartz NiFi is ideal for exactly your needs. NiFi is a 100% open source Apache project. NiFi also is packaged in Hortonworks Data Flow (HDF) platform where it is bundled with Kafka, Storm, Ambari and Ranger. HDF is completely enterprise multitenant and secure. NiFi is built to pull data from dozens of data sources ranging from relational databases to email to twitter, local files,S3, HTTP and so on. It has prebuilt connectors to these sources and is developed in an easy-to-configure drag-and-drop way. You can easily build your own connectors, and since this is open source new ones are added continuously. In addition to pulling from a number of sources you can push to diverse target sources as well. HDFS, hive, kafka are possibilities as well as email, Amazon S3 and many more. Note that HDF works as a great compliment to HDP (hadoop) but does not require it. In between pulling from sources and pushing to targets, NiFi allows you to transform data, route based on contents, merge data and more mediations of data. You can get an idea of the data sources you can pull from, the mediations you can make on that data, and the targets you can push to by looking at this list of processors (processors are the basic units you connect into a data flow).https://nifi.apache.org/docs.html Again, one of the great things about NiFi is its easy to use UI/configuration approach (screenshot below answer). HCC has numerous articles on NiFi. Just do a search. Check out: http://hortonworks.com/apache/nifi/ http://hortonworks.com/blog/hortonworks-dataflow-2-0-ga/ https://nifi.apache.org/docs.html https://www.youtube.com/watch?v=jctMMHTdTQI You can download and start using it here: http://hortonworks.com/downloads/#dataflow

gkeys · ‎10-03-2016

@Adda Fuentes If you feel the answer satisfies your needs, let me know by accepting the answer. (Also, you can always add a comment to a post vs adding an answer -- better way to use the site). Looking forward to your continued interaction with HCC 🙂

gkeys · ‎10-03-2016

Both are MPP (massively parallel processing) databases designed to query large volumes of data (e.g. PB) at relatively fast response times (e.g. seconds) though much around performance depends on server scaling, query type, database design and table design. HAWQ is open source Apache -- its roadmap is driven by the Apache community and it can be implemented with Hortonworks HDP hadoop platform as well as other hadoop platforms. You can download HAWQ and implement on the Hortonworks sandbox (or full cluster). Presto is Apache-licensed but not an Apache project -- its roadmap is driven by Teradata and not the community. Presto has the advantage of being able to query data inside and outside HDFS whereas HAWQ is confined to HDFS or tables built on HDFS which are optimized using the parquet file format. For queries against hadoop, Presto is not natively YARN-enabled but you can integrate it to YARN via Twill. HAWQ is natively YARN-enabled. HAWQ is 100% postgreSQL compliant (e.g. you can implement pgAdmin against it) whereas Presto offers extensive ANSI SQL support but is not 100% compatible. HAWQ is generally faster than Presto. HAWQ has a MADLIB data science and machine learning plugin that lets you do complex data science as functions inside your sql queries and against your database. Diving deeper to compare the two is a much more complex topic. For example they differ significantly in their architecture and scaling strategies. Note that Hive has made great strides in recent years and months and is approaching HAWQ in its query response times. This is largely due to the focus the Apache community (and Hortonworks) has given to optimizing Hive for the ORC file format, LLAP in memory caching and cost-based optimization. Use the following for general resources and for taking a much deeper dive. You could probably stay up all night discussing this question from a technical deep-dive perspective. HAWQ http://hawq.incubator.apache.org/ http://hortonworks.com/apache/hawq/ https://blog.pivotal.io/big-data-pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions Presto http://www.teradata.com/products-and-services/presto-download/?pcid=Google_Presto-US-EN-GGL-BMM_paidsearch_%2Bpresto%20%2Bsql&utm_source=Google&utm_campaign=Presto-US-EN-GGL-BMM&utm_medium=paidsearch&utm_term=%2Bpresto%20%2Bsql http://siliconangle.com/blog/2015/06/09/teradata-adopts-presto-for-hadoop-sql-queries/ Hive http://hortonworks.com/apache/hive/ https://cwiki.apache.org/confluence/display/Hive/Home

gkeys · ‎10-03-2016

@Robbert Try this link if you want to explore using the jdbc driver: https://community.hortonworks.com/questions/56113/trying-to-create-phoenix-interpreter-using-jdbc-in.html (Also, if you find my answer provided what you were looking for, let me know by accepting 🙂 )

gkeys · ‎09-30-2016

Not sure of your exact question, but typically it is a good idea to compress the output of your map step in map-reduce jobs. This is because this data is written to disk and then sent within your cluster to the reducer (shuffle) and the overhead of compressing/decompressing is almost always minimal compared to the large gains from sending over the wire significantly lower data volumes from compressed data. To set this for all of your jobs, use these configs in mapred-site.xml" <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> You can of course set the first value to false in mapred-site.xml and override it by setting it for each job (e.g. as a parameter in the command line or set at the top of a pig script). See this link for details: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html

gkeys · ‎09-30-2016

When you click on the Interpreters tab you will see that %jdbc is preconfigured for postgres. Not sure why this version of the sandbox is set up that way ... but that is the way it is. Either try %hive or try to configure the Interpreter for %jdbc hive using as a guide https://community.hortonworks.com/questions/56113/trying-to-create-phoenix-interpreter-using-jdbc-in.html

gkeys · ‎09-29-2016

Windows-compatible HDP 2.4.3 is the last version of HDP that runs on Windows. It does not have a GUI to manage the cluster. Install instructions are here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/meet-min-system-requirements.html HDP 2.3.x is deprecated (and for Windows did not have a GUI to manage the cluster). Linux only HDP 2.5+ does not install on Windows. Microsoft Cloud You can install HDP to Microsoft Azure cloud platform and this will install Ambari ... but the images are Linux. http://hortonworks.com/blog/easy-steps-to-create-hadoop-cluster-on-microsoft-azure/

gkeys · ‎09-29-2016

Give it a try and let me know what happens. I tested it successfully on the latest sandbox. Sometimes updates to the tutorials are delayed -- the latest sandbox itself came out last week

gkeys · ‎09-29-2016

Give this a try %hive

gkeys · ‎09-28-2016

@Eugene Geis Thank you for your detailed description of your issues. There is a single overarching theme to my answer: your cluster is not properly sized for the processing you are doing. Big Data on Hadoop leverages horizontal scaling and like all data processing ... it can reach resource constraints under a given implementation. I had a similar situation happen to me the first time I worked on Hadoop. I had 2.53 billion records that I bulk uploaded to HBase where each record held 57 columns. I was on a 8 node cluster and the first time I bulk loaded to HBase it brought zookeeper, hbase and the cluster to its knees and then to a groaning death. Ultimate root cause was the number of zookepper connections were configured way too low (for the extreme workload I threw at it). I had to configure these and then I bulk loaded in separate chunks as opposed to one shot. Things were still not ideal because HBase major compaction ran for hours afterwards, stressing cpu and memory on all of the nodes. I eventually resized the cluster (added more nodes) to accommodate the load that I was throwing at the cluster. To answer your question, you are throwing too much load at your cluster, given its size. Hadoop is famously robust but only when properly sized. Regarding your local directories filling up, pig runs map-reduce jobs under the covers and intermediate (temporary) data is written to disk between the map and reduce steps. The large amount of intermediate data (produced by your triple join of a TB of data) is spread among so few nodes in you cluster that it exceeds capacity on some of them. My suggestion is to start with lower loads on your given cluster and learn how to optimize your jobs. For example, one common optimization is to compress your intermediate data. See this link on optimizing pig: https://community.hortonworks.com/questions/57449/fine-tune-the-pig-job.html#comment-58059 Next suggestion (after learning to optimize) is to add more data nodes to your cluster to horizontally scale the load. You could simply add nodes and not optimize ... but we always want to optimize to use resources more wisely. See this link for help on sizing your cluster: http://info.hortonworks.com/SizingGuide.html

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: best tools to import data from a myriad of sou...

Re: How does HAWQ compare to Presto?

Re: How does HAWQ compare to Presto?

Re: %jdbc(hive) prefix not found in Zeppelin

Re: HDFS Compression along with mapreduce codec a ...

Re: Error as below occurred while running Tutorial...

Re: Does HDP on windows supports any GUI to manage...

Re: %jdbc(hive) prefix not found in Zeppelin

Re: %jdbc(hive) prefix not found in Zeppelin

Re: Why is a Pig Join and Store crashing my cluste...