About gkeys

gkeys · ‎10-10-2016

@Bibhas BurmanThat is an excellent tutorial for pushing log data to HDFS for historical analysis. If you want to do real-time streaming analysis here are two links that should be useful http://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/ (ignore the storm part) https://community.hortonworks.com/articles/44550/horses-for-courses-apache-spark-streaming-and-apac.html (integrate with the kafka part from the first link) Since you are getting your feet wet with the technology, definitely put in some time to play around with it and build small projects before working toward your end product. And of course ... anytime you have a question along the way ask the HCC to get some guidance.

gkeys · ‎10-10-2016

NiFi is easy at capturing logs. Why not use all technologies where they are best: NiFi to gather log data in realtime -> kafka queue -> Spark streaming analytics -> Zeppelin for spark and visualization. You could also fork NiFi to mergecontent to hdfs to keep for historical analysis. All technologies come out-of-the-box with HDF and HDP.

gkeys · ‎10-07-2016

These should be helpful http://stackoverflow.com/questions/32080475/how-to-read-a-zip-containing-multiple-files-in-apache-spark http://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark

gkeys · ‎10-07-2016

If you are using HDP, all of the tools discussed below are deployed when you install the distribution. Store your data Definitely store your data in Hadoop. Spend some time thinking about how you will organize this from a file system perspective. http://hortonworks.com/apache/hdfs/ Sqoop is a fast and effective way to pull your data from relational databases into hadoop. http://hortonworks.com/apache/sqoop/ Clean your data You may need to clean or transform that data after it has landed in Hadoop, e.g. trimming leading and trailing whitespaces, removing non-ascii characters. Pig scripts can do this quick and effectively. If you do have to clean the data, keep the raw data in one zone (HDFS directory) and clean it to a destination zone. http://hortonworks.com/apache/pig/ Analyze and visualize your data You most likely will want to use Spark to do your predictive analysis. Spark is deployed with HDP. It is an in-memory processing engine with libraries to easily perform sql and machine learning/predictive analysis against your data. Being in-memory, analysis of GBs of data is very rapid. These libraries are accessed with Java, Scala or Python APIs. (There are also streaming and graph capabilities, but it looks like you will not need these for your analysis). https://hortonworks.com/apache/spark/ Zeppelin is an awesome UI to perform Spark analyses. It is a notebook style UI -- it is browser based and composed of separate "paragraphs" which are areas to perform separate steps of your analysis. Each paragraph is loaded with an interpreter. These interpreters allow you to write shell commands directly against the linux box hosting the Zeppelin server, or to perform your predictive analysis using Spark's sql and machine learning / predictive analyses. Zeppelin also has easy to use visualization capabilities. https://hortonworks.com/apache/zeppelin/ You may want to use Hive to perform complex SQL against your data. Hive is a SQL engine on Hadoop that is very effective in analyzing huge volumes of both structured and unstructured data. (Spark can reach limits on huge data sizes). For example, you can analyze tweets where fields in the hive table are json strings. Or you can do complex joins across multiple tables. Hive is not as fast as Spark, but it is solid against any volume of data and complexity of query. Having said that, Hive performance has increased greatly in the past few years ... largely by implementation of the Tez engine, ORC file format, and in-memory LLAP. You can build Hive tables from Spark and analyze from both, or you can build Hive tables through Hive and also analyze in Spark. http://hortonworks.com/apache/hive/ General As mentioned, all of the above tools come out of the box with HDP (current version is 2.5). You can run your analysis from either a browser-based UI (Zeppelin, Ambari views) or from the command line from server in the cluster (you may want to set up a specialized "edge node" to perform analysis from the command line). Your Approach It sounds like you are about to launch on a very large project. Be sure to start small by working with small samples of your data to learn the technology and to understand how best to design how you store and analyze the data. You can get a quick start by downloading the sandbox and following tutorials. http://hortonworks.com/products/sandbox/?gclid=CjwKEAjwj92_BRDQ-NuC98SZkWYSJACWmjhlzsGZqc3fexfPwVWKFOOLOUf__SAlbb1JVpafHxq5bxoC3-Hw_wcB

gkeys · ‎10-05-2016

@Jonas Carson Using custom properties files should solve your needs perfectly. Bottom line is you configure your processor with NiFi Expression Language that references the custom property e.g ${my.cust.prop.name}. Each environment has its own instance of the custom property file -- with same property names as file deployed to other environments but values specific to the environment. To implement this, open the nifi.properties file and set the field nifi.variable.registry.properties to a comma-delimited list of paths to custom property files. Be sure to make your property names unique if you are using more than one property file in the same envt. Also, they must be unique from system and environment properties. See the following links for more information: https://community.hortonworks.com/articles/57304/supporting-custom-properties-for-expression-langua.html https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Using_Custom_Properties Note: You can also refer to system variables and environment variables the same way: ${system.variable.name}

gkeys · ‎10-05-2016

Use --outputformat=dsv This output format is configurable, but the default is a pipe so the above should be sufficient for your needs. If you want to use something else as a delim, add --delimiterForDSV=DELIMITER For more details, see: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Separated-ValueOutputFormats (If this is what you are looking for, let me know by accepting the answer. Else, let me know the gaps in the answer).

gkeys · ‎10-05-2016

Command Line If these three files are in the same directory, run the following from the command line of a server in the cluster. It will merge the files into one file and store it locally hdfs dfs -getmerge <hdfsDir> <localFile> where <hdfsDir> is the directory holding the files on hdfs and <localFile> is the name of the merged file that will be stored locally. If you are talking about a directory structure that looks like this in HDFS: myFile.txt/_SUCCESS myFile.txt/part-m-00000 myFile.txt/part-m-00001 this is the result of a map-reduce job. <hdfsDir> in this case would be myFile.txt. Note that _SUCCESS is a 0 byte file: there are not contents -- it is just a flag to designate the m-r job was a success. Ambari Alternatively, you can do this from the File View on Ambari. Just open the directory holding the files you want to merge to one. Then check the files you want to merge Then click concatenate from the far right dropdown This will download the merged (concatenated) files from your browser. Note for both approaches: The above works for multiple files in the same directory even if the files are not the result of a map-reduce job (but is typically used for map-reduce results). (If this is what you were looking for, please let me know by accepting the answer. Else, let me know the gaps in the answer).

gkeys · ‎10-04-2016

If you do not have to worry about partitions, it is as you state: INSERT OVERWRITE old_data SELECT <statement> FROM new_data; If you have a partition you must specify it as INSERT OVERWRITE old_data PARTITION (id = <value>) SELECT <statement filtering by id> FROM new_data; Note for the SELECT statement you have to select the same columns and column order as those you are inserting into. See the following for more color: https://community.hortonworks.com/questions/28683/overwriting-a-column-in-hive.html https://community.hortonworks.com/questions/5579/insert-overwrite-of-2-gb-data.html https://community.hortonworks.com/questions/49967/insert-overwrite-running-too-slow-when-inserting-d.html

gkeys · ‎10-04-2016

@Seyma Menjour Glad to hear you are flying though the technology stack with such ease 🙂 BTW, one little trick with Zeppelin is you can hide either the command or the output. Small touch but hiding the Zeppelin commands after you run them can make story-telling to non-tech folks more direct -- you only see the visualizations.

gkeys · ‎10-04-2016

That tutorial has been completely replaced with visualization by Zeppelin. If you really want to try to use it again, you can find a copy here https://github.com/hortonworks/tutorials/blob/f5f97f40807157891c2c9c85e279182d44fdc1ee/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop/hello-hdp-section-9.md but all the links to images are broken. This is taken from an old github repo https://github.com/hortonworks/tutorials/tree/f5f97f40807157891c2c9c85e279182d44fdc1ee/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop That is the best you will find with this old tutorial. Bottom line though is that Zeppelin is an exciting new analytics and visualization tool that is worth investing your time in learning. It is the current and future direction of Big Data analytics and visualization for most data discover, exploration and story-telling activities. Check it out: http://hortonworks.com/apache/zeppelin/ (If this answer is what you are looking for, please let me know by accepting the answer)

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Q1. Which is the best option in IT industry to...

Re: Q1. Which is the best option in IT industry to...

Re: How to process a word count on zipped files in...

Re: Hadoop open source design and implementation

Re: Recommended way to use environment management ...

Re: Dumping the beeline output to a file, by defau...

Re: How can we see the output in single file if 3 ...

Re: insert or update if exists in hive

Re: guetting started with HDP tutoriel Lab 7 not f...

Re: guetting started with HDP tutoriel Lab 7 not f...