Member since
06-18-2016
52
Posts
14
Kudos Received
0
Solutions
01-11-2021
09:35 PM
Later versions of hive have a "sys" DB that under the hood connects back to the hive metastore database (eg Postgres or whatever). and you can query that. Impala seems not to be able to see this sys db though. There is also a "information_schema" DB with a smaller and cleaner subset but it points back to sys and also not visible from impala if you do a "show databases;" You can use "show" statements in impala-shell but I'm not sure there is a DB to through SQL at via ODBC/JDBC. Still looking for a way to do this in impala
... View more
10-10-2016
09:37 PM
you mean to say like put file using hadoop fs -put? if so then 'no' it take the advantage of filesystem api to write on hdfs
... View more
09-29-2016
05:07 PM
Yes. That's sort of what I had in mind, but it'll still depend on how balanced/imbalanced your data is. There are algorithms for doing this more intelligently too but I've never looked at how to do them in Spark. It looks like the FPGrowth() classes expose a support proportion, but I can't quite tell what it does if you have, e.g., 10k 1's and 100 items with count > 1. I probably can't take you much further without doing some reading.
... View more
01-03-2017
12:19 PM
HI @Greg Keys Could you please provide input on my clarification
... View more
09-09-2016
09:48 PM
I might try writing a UDF with custom counters, sounds like an interesting challenge
... View more
09-06-2016
10:16 PM
1 Kudo
@Pedro Rodgers you can get your multiple files into a Spark RDD with: val data = sc.textFile("/user/pedro/pig_files/*txt")
or even val data = sc.textFile("/user/pedro/pig_files") From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD. Now if you want to merge those files into one and rewrite to HDFS again, it is just: data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir") You can not determine the name of the output file (easily), just the HDFS folder will do Hope this helps
... View more
09-06-2016
09:46 AM
scala> val a = sc.textFile("/user/.../path/to/your/file").map(x => x.split("\t")).filter(x => x(0) != x(1))
scala> a.take(4)
res2: Array[Array[String]] = Array(Array(1, 4), Array(2, 5), Array(1, 5))
Try the snippet above, just insert the path to your file on hdfs.
... View more
09-05-2016
02:11 PM
Then I'd try the following: vertices.map(_.split(" ")).saveAsTextFile("my/hdfs/path/directory")
... View more
08-25-2016
01:13 PM
B) Create a hive table. The Hive table should have all the columns stated in your hive2parquet.csv file. Assume (col1, col2, col3). Also assume your csv file is in /tmp dir inside HDFS. 1- Log into Hive and at hive command prompt and execute 2- and 3- and C) below; // create the hive table 2- create table temp_txt (col1 string,col2 string, col3 string) row format delimited fields terminated by ','; // load the hive table with hive2parquet.csv file 3- load data input ' /tmp/hive2parquet.csv' into table temp_text; // Insert from table 'temp_txt' to table 'table_parquet_file' C- insert into table table_parquet_file select * from temp_txt;
... View more
08-22-2016
11:55 AM
1 Kudo
Hi Pedro, python API for Spark is still missing, however there is a git project with a higher level API on top of Spark GraphX called GraphFrames: (GraphFrames) . The project claims: "GraphX is to RDDs as GraphFrames are to DataFrames." I haven't worked with it, however a quick test of their samples with Spark 1.6.2 worked: Use pyspark like this: pyspark --packages graphframes:graphframes:0.2.0-spark1.6-s_2.10 or use zeppelin and add the dependencies to the interpreter configuration. Maybe this library has what you need.
... View more