About Stewart12586

dwyer_mlj · ‎01-11-2021

Later versions of hive have a "sys" DB that under the hood connects back to the hive metastore database (eg Postgres or whatever). and you can query that. Impala seems not to be able to see this sys db though. There is also a "information_schema" DB with a smaller and cleaner subset but it points back to sys and also not visible from impala if you do a "show databases;" You can use "show" statements in impala-shell but I'm not sure there is a DB to through SQL at via ODBC/JDBC. Still looking for a way to do this in impala

rajkumar_singh · ‎10-10-2016

you mean to say like put file using hadoop fs -put? if so then 'no' it take the advantage of filesystem api to write on hdfs

jfrazee · ‎09-29-2016

Yes. That's sort of what I had in mind, but it'll still depend on how balanced/imbalanced your data is. There are algorithms for doing this more intelligently too but I've never looked at how to do them in Spark. It looks like the FPGrowth() classes expose a support proportion, but I can't quite tell what it does if you have, e.g., 10k 1's and 100 items with count > 1. I probably can't take you much further without doing some reading.

vamsi123 · ‎01-03-2017

HI @Greg Keys Could you please provide input on my clarification

aervits · ‎09-09-2016

I might try writing a UDF with custom counters, sounds like an interesting challenge

jknulst · ‎09-06-2016

@Pedro Rodgers you can get your multiple files into a Spark RDD with: val data = sc.textFile("/user/pedro/pig_files/*txt") or even val data = sc.textFile("/user/pedro/pig_files") From this point onwards the Spark RDD 'data' will have as many partitions as there are pig files. Spark is just as happy with that, since distributing the data brings more speed and performance to anything you want to do on that RDD. Now if you want to merge those files into one and rewrite to HDFS again, it is just: data.repartition(1).saveAsTextFile("/user/pedro/new_file_dir") You can not determine the name of the output file (easily), just the HDFS folder will do Hope this helps

pminovic · ‎09-06-2016

scala> val a = sc.textFile("/user/.../path/to/your/file").map(x => x.split("\t")).filter(x => x(0) != x(1)) scala> a.take(4) res2: Array[Array[String]] = Array(Array(1, 4), Array(2, 5), Array(1, 5)) Try the snippet above, just insert the path to your file on hdfs.

pvillard · ‎09-05-2016

Then I'd try the following: vertices.map(_.split(" ")).saveAsTextFile("my/hdfs/path/directory")

adaher · ‎08-25-2016

B) Create a hive table. The Hive table should have all the columns stated in your hive2parquet.csv file. Assume (col1, col2, col3). Also assume your csv file is in /tmp dir inside HDFS. 1- Log into Hive and at hive command prompt and execute 2- and 3- and C) below; // create the hive table 2- create table temp_txt (col1 string,col2 string, col3 string) row format delimited fields terminated by ','; // load the hive table with hive2parquet.csv file 3- load data input ' /tmp/hive2parquet.csv' into table temp_text; // Insert from table 'temp_txt' to table 'table_parquet_file' C- insert into table table_parquet_file select * from temp_txt;

bwalter1 · ‎08-22-2016

Hi Pedro, python API for Spark is still missing, however there is a git project with a higher level API on top of Spark GraphX called GraphFrames: (GraphFrames) . The project claims: "GraphX is to RDDs as GraphFrames are to DataFrames." I haven't worked with it, however a quick test of their samples with Spark 1.6.2 worked: Use pyspark like this: pyspark --packages graphframes:graphframes:0.2.0-spark1.6-s_2.10 or use zeppelin and add the dependencies to the interpreter configuration. Maybe this library has what you need.

Online	Offline
Last Visited	‎11-21-2018 01:05 PM

Member Since	‎06-18-2016 04:27 AM
Last Visited	‎11-21-2018 01:05 PM
Posts	52
Kudos received	14

Cloudera Community

Re: Impala/Hive - Query sys.tables objects

Re: HDFS - MapReduce -> Basic Questions

Re: Spark Mllib - Frequent Pattern Mining - stran...

Re: Using PIG Latin to replace multiple strings fr...

Re: Count values that are filtered - Apache PIG

Re: Spark Scala - Join multiple files using Spark

Re: Spark Scala - Remove rows that have columns wi...

Re: value saveAsTextFile is not a member of Array[...

Re: Stored data from CSV into a Parquet File and e...

Re: Link Analysis using Spark Python