About paul_boal

paul_boal · ‎06-17-2016

@khushi kalra Take a look at this list of tutorials. They should get you forward a few more steps. http://henning.kropponline.de/2014/07/13/hive-r/ First, this R/JDBC tutorial (or @Sindhu's post above) can get you through making a database connection. From the link above, you can see a couple of lines where this guy pulls data from a table and does a simple plot. sample_08<-dbReadTable(conn,"sample_08") plot(sample_08$sample_08.salary) You'll probably want to do more sophisticated SQL and plots, though. The documentation for RJDBC can be found here: https://cran.r-project.org/web/packages/RJDBC/index.html To run an arbitrary query, you use the dbSendQuery() and dbFetch commands as from this tutorial: http://www.inside-r.org/packages/cran/DBI/docs/dbGetQuery res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4;") data <- dbFetch(res) Now 'data' will have the results you can plot. To do any kind of sophisticated plots in R, the typical thing to do is use the 'ggplots' library. There are lots of tutorials out there. The connection to what you've done with RJDBC is that the 'data' object above is a dataframe that you can use in building your charts. Here's one ggplots tutorial: http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html hist(data$some.value)

paul_boal · ‎06-17-2016

According to the Nifi documentation, the timezone is embedded in the date information. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#types What if you included the timezone in your input and parsed with 'Z' in the format?

paul_boal · ‎06-16-2016

There's a really simple example that uses RODBC to query Hive from R. Should work in RStudio just fine, but you might need to adjust some instructions based on your Hive environment versus the HDInsight example. https://blogs.technet.microsoft.com/meacoex/2014/06/07/connecting-r-to-hdinsight-through-hive/

paul_boal · ‎06-15-2016

@Vijay Parmar I think you'll want to use some kind of outside tool to orchestrate that series of activities, rather than trying to do it all within the Hive environment. HiveQL doesn't have the ability, by itself, to make a series of HTTP calls to external services and retrieve data from them. You could take all of these steps and script them in something like Python, and then call that Python script as a Hive UDF, but I would recommend looking at Nifi / HDF to orchestrate that process. Use QueryDatabaseTable processor to access the Teradata table that you need (via JDBC). Use EvaluateJSONPath processor to pull out the specific URL attribute in the JSON. Use Get/PostHTTP processor to make the HTTP call to get the next JSON. Use EvaluateJSONPath processor to pick out the pieces of that document that you want to write to Hive. Use PutHDFS processor to write the output into the HDFS location. Then layer an external Hive table on top of that HDFS location. You might also use some other processors in the middle there to merge content together into a single file or otherwise optimize things for the final output format. How you approach this probably depends on what tools you have on hand and how much data you're going to be running through the process and how often it has to run.

paul_boal · ‎06-14-2016

@Bruce Perez Are you looking for the SQL that will help you achieve the "filling" that you want to do? Looking at your simplified picture, I created two tables: dates and date_rate and populated them with your sample data. The query you can use to do the appropriate joins and "fill down" the missing values should look roughly like this. There may be other methods, but in this one, we create a new view of your date_rate data, by assuming the "end_date" is one less than the next date_rate record. Then, we can join the dates table onto that using the range instead. SELECT d.*, dr.* FROM dates d LEFT OUTER JOIN ( SELECT r1.rate, r1.dt AS start_date, MIN(DATE_SUB(r2.dt, 1)) AS end_date FROM date_rate r1 LEFT OUTER JOIN date_rate r2 ON r1.dt < r2.dt GROUP BY r1.rate, r1.dt ) dr ON d.dt >= dr.start_date AND (d.dt <= dr.end_date OR dr.end_date IS NULL) ORDER BY d.dt

paul_boal · ‎06-06-2016

Can you say a little bit more about the text files? Are they all the same kind of data and format, or different? How big are the text files in terms of GB and number of rows/columns?

paul_boal · ‎06-06-2016

I'd recommend that you take a serious look at creating a custom NiFi processor. I found it to be a fairly straightforward exercise and wrote a recent post about it here: https://hortonworks.com/blog/apache-nifi-not-scratch/

paul_boal · ‎06-02-2016

Just double-checking the obvious: Do you have Python installed on your Windows machine? And have you installed the petl and xlrd modules using pip? You can test the script by itself just by running it from the command line. Have you tried that?

paul_boal · ‎06-01-2016

You didn't mention if you already have a mechanism for converting your XLS files into CSV, but here's a way you might what to orchestrate all of this in NiFi. Use a ListFile processor to list all of the *.xls files in your input directory, it will output a flowfile for each xls file it finds. Route that to an ExecuteStreamCommand processor that runs a simple program to convert your XLS to CSV. I'd recommend a simple Python script that uses the petl module. Have that script write the output into the same directory. Have a separate NiFi flow that: Uses a GetFile processor to do whatever it is that you want to do with CSV files. Point it at that same input directory. Make sure you configure in the File Filter property to only pick up CSV files, though. You could use different input directories, too, of course. The trick, I think, is using ListFile and a conversion script. Here's a simple outline of a python script that should work in most cases. It assumes that you always want to convert just the first sheet. #!/usr/bin/env python import sys import petl as etl import xlrd # Pass ${absolute.path}/${filename} as a command line argument inputFile = sys.argv[1] xls = etl.fromxls(inputFile) etl.tocsv(xls, inputFile+".csv", write_header=True)

paul_boal · ‎03-21-2016

You've mentioned Python to implement TF-IDF, but unless you absolutely have to use Python for some other reason, then you can consider implementing the same algorithm using Hive SQL instead. That way, it'll run in parallel without any extra work. Take a look at the Wikipedia article on TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) Here's one sample SQL implementation of TF-IDF that you could build Hive SQL from by ignoring all the index related stuff : https://gist.github.com/sumanthprabhu/8067221

Online	Offline
Last Visited	‎05-26-2017 01:59 PM

Member Since	‎12-08-2015 08:04 PM
Last Visited	‎05-26-2017 01:59 PM
Posts	24
Kudos received	22

Cloudera Community

Re: Please if anyone can give me good examples of ...

Re: Nifi problem with daylight saving

Re: Fill 'Null' With Previous Row Values in Hive

Re: Running Python Scripts on data in HDFS

Re: Beeline CLI fails compiling order by while Hue...

Re: Please if anyone can give me good examples of ...

Re: Nifi problem with daylight saving

Re: Please if anyone can give me good examples of ...

Re: How can I automate a process in Hive?

Re: Fill 'Null' With Previous Row Values in Hive

Re: Best way to analyze and transform big data in ...

Re: Which Nifi processor can send a flowfile to a ...

Re: where to convert .xls file to .csv file inside...

Re: where to convert .xls file to .csv file inside...

Re: Running Python Scripts on data in HDFS