Member since
12-08-2015
24
Posts
22
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3587 | 06-17-2016 01:24 PM | |
2138 | 06-17-2016 12:29 AM | |
11648 | 06-14-2016 07:59 PM | |
7818 | 03-21-2016 10:33 PM | |
1956 | 01-12-2016 07:09 PM |
06-17-2016
01:24 PM
@khushi kalra Take a look at this list of tutorials. They should get you forward a few more steps. http://henning.kropponline.de/2014/07/13/hive-r/ First, this R/JDBC tutorial (or @Sindhu's post above) can get you through making a database connection. From the link above, you can see a couple of lines where this guy pulls data from a table and does a simple plot. sample_08<-dbReadTable(conn,"sample_08")
plot(sample_08$sample_08.salary) You'll probably want to do more sophisticated SQL and plots, though. The documentation for RJDBC can be found here: https://cran.r-project.org/web/packages/RJDBC/index.html To run an arbitrary query, you use the dbSendQuery() and dbFetch commands as from this tutorial: http://www.inside-r.org/packages/cran/DBI/docs/dbGetQuery res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4;")
data <- dbFetch(res) Now 'data' will have the results you can plot. To do any kind of sophisticated plots in R, the typical thing to do is use the 'ggplots' library. There are lots of tutorials out there. The connection to what you've done with RJDBC is that the 'data' object above is a dataframe that you can use in building your charts. Here's one ggplots tutorial: http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html hist(data$some.value)
... View more
06-17-2016
12:29 AM
According to the Nifi documentation, the timezone is embedded in the date information. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#types What if you included the timezone in your input and parsed with 'Z' in the format?
... View more
06-16-2016
04:03 PM
1 Kudo
There's a really simple example that uses RODBC to query Hive from R. Should work in RStudio just fine, but you might need to adjust some instructions based on your Hive environment versus the HDInsight example. https://blogs.technet.microsoft.com/meacoex/2014/06/07/connecting-r-to-hdinsight-through-hive/
... View more
06-15-2016
07:35 PM
1 Kudo
@Vijay Parmar I think you'll want to use some kind of outside tool to orchestrate that series of activities, rather than trying to do it all within the Hive environment. HiveQL doesn't have the ability, by itself, to make a series of HTTP calls to external services and retrieve data from them. You could take all of these steps and script them in something like Python, and then call that Python script as a Hive UDF, but I would recommend looking at Nifi / HDF to orchestrate that process. Use QueryDatabaseTable processor to access the Teradata table that you need (via JDBC). Use EvaluateJSONPath processor to pull out the specific URL attribute in the JSON. Use Get/PostHTTP processor to make the HTTP call to get the next JSON. Use EvaluateJSONPath processor to pick out the pieces of that document that you want to write to Hive. Use PutHDFS processor to write the output into the HDFS location. Then layer an external Hive table on top of that HDFS location. You might also use some other processors in the middle there to merge content together into a single file or otherwise optimize things for the final output format. How you approach this probably depends on what tools you have on hand and how much data you're going to be running through the process and how often it has to run.
... View more
06-14-2016
07:59 PM
1 Kudo
@Bruce Perez Are you looking for the SQL that will help you achieve the "filling" that you want to do?
Looking at your simplified picture, I created two tables: dates and date_rate and populated them with your sample data. The query you can use to do the appropriate joins and "fill down" the missing values should look roughly like this. There may be other methods, but in this one, we create a new view of your date_rate data, by assuming the "end_date" is one less than the next date_rate record. Then, we can join the dates table onto that using the range instead.
SELECT
d.*,
dr.*
FROM
dates d LEFT OUTER JOIN
(
SELECT
r1.rate,
r1.dt AS start_date,
MIN(DATE_SUB(r2.dt, 1)) AS end_date
FROM
date_rate r1 LEFT OUTER JOIN
date_rate r2 ON r1.dt < r2.dt
GROUP BY
r1.rate, r1.dt
) dr ON d.dt >= dr.start_date AND (d.dt <= dr.end_date OR dr.end_date IS NULL)
ORDER BY
d.dt
... View more
06-06-2016
04:01 AM
Can you say a little bit more about the text files? Are they all the same kind of data and format, or different? How big are the text files in terms of GB and number of rows/columns?
... View more
06-06-2016
03:37 AM
1 Kudo
I'd recommend that you take a serious look at creating a custom NiFi processor. I found it to be a fairly straightforward exercise and wrote a recent post about it here: https://hortonworks.com/blog/apache-nifi-not-scratch/
... View more
06-02-2016
03:50 AM
Just double-checking the obvious: Do you have Python installed on your Windows machine? And have you installed the petl and xlrd modules using pip? You can test the script by itself just by running it from the command line. Have you tried that?
... View more
06-01-2016
05:55 AM
2 Kudos
You didn't mention if you already have a mechanism for converting your XLS files into CSV, but here's a way you might what to orchestrate all of this in NiFi. Use a ListFile processor to list all of the *.xls files in your input directory, it will output a flowfile for each xls file it finds. Route that to an ExecuteStreamCommand processor that runs a simple program to convert your XLS to CSV. I'd recommend a simple Python script that uses the petl module. Have that script write the output into the same directory. Have a separate NiFi flow that: Uses a GetFile processor to do whatever it is that you want to do with CSV files. Point it at that same input directory. Make sure you configure in the File Filter property to only pick up CSV files, though. You could use different input directories, too, of course. The trick, I think, is using ListFile and a conversion script. Here's a simple outline of a python script that should work in most cases. It assumes that you always want to convert just the first sheet. #!/usr/bin/env python
import sys
import petl as etl
import xlrd
# Pass ${absolute.path}/${filename} as a command line argument
inputFile = sys.argv[1]
xls = etl.fromxls(inputFile)
etl.tocsv(xls, inputFile+".csv", write_header=True)
... View more
03-21-2016
10:33 PM
4 Kudos
You've mentioned Python to implement TF-IDF, but unless you absolutely have to use Python for some other reason, then you can consider implementing the same algorithm using Hive SQL instead. That way, it'll run in parallel without any extra work. Take a look at the Wikipedia article on TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) Here's one sample SQL implementation of TF-IDF that you could build Hive SQL from by ignoring all the index related stuff : https://gist.github.com/sumanthprabhu/8067221
... View more