Created 05-19-2017 05:29 AM
input1:create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; create external table db.emp1(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; input2:create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../';; create table db.emp1(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; requeried output:create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; these two files stored under file1.hql ****** create external table db.emp1(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; these two files stored under file2.hql
and so on
i tried with: filenames = ['in1.txt', 'in2.txt'] with open('result.txt', 'w') as outfile: for fname in filenames: with open(fname) as infile: content = infile.read().replace('\n', '') outfile.write(content)
but i didn't get requeried output. Please give me a hint how to achieve this.
Created 05-21-2017 09:46 PM
I'll try this a couple of ways. First, I got two files loaded into HDFS.
[maria_dev@sandbox 103650]$ hdfs dfs -ls /user/maria_dev/hcc/103650/ Found 2 items -rw-r--r-- 1 maria_dev hdfs 12 2017-05-21 21:21 /user/maria_dev/hcc/103650/file1.tsv -rw-r--r-- 1 maria_dev hdfs 15 2017-05-21 21:21 /user/maria_dev/hcc/103650/file2.tsv [maria_dev@sandbox 103650]$ hdfs dfs -cat hcc/103650/file1.tsv 1 A 2 B 3 C [maria_dev@sandbox 103650]$ hdfs dfs -cat hcc/103650/file2.tsv 24 X 25 Y 26 Z [maria_dev@sandbox 103650]$
I then ran the following DDL and the appropriate load statement to sweep those earlier files into the tables.
CREATE TABLE table1 ( id INT, letter STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; CREATE TABLE table2 ( id INT, letter STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; LOAD DATA INPATH '/user/maria_dev/hcc/103650/file1.tsv' OVERWRITE INTO TABLE table1; LOAD DATA INPATH '/user/maria_dev/hcc/103650/file2.tsv' OVERWRITE INTO TABLE table2;
Here's the output from the SELECT statements as well as cat'ing the moved files.
hive> select * from table1; OK 1 A 2 B 3 C Time taken: 1.738 seconds, Fetched: 3 row(s) hive> select * from table2; OK 24 X 25 Y 26 Z Time taken: 0.186 seconds, Fetched: 3 row(s) hive> [maria_dev@sandbox 103650]$ hdfs dfs -cat /apps/hive/warehouse/table1/file1.tsv 1 A 2 B 3 C [maria_dev@sandbox 103650]$ hdfs dfs -cat /apps/hive/warehouse/table2/file2.tsv 24 X 25 Y 26 Z [maria_dev@sandbox 103650]$
Now... that should get me caught up to your problem which I understand is simply that you need to read two (or more) files into a single Spark dataset.
Here is doing that with the RDD API.
>>> rddTabs = sc.textFile("/apps/hive/warehouse/table1/file1.tsv,/apps/hive/warehouse/table2/file2.tsv") >>> rddTabs.count() 6 >>> rddTabs.take(6) [u'1\tA', u'2\tB', u'3\tC', u'24\tX', u'25\tY', u'26\tZ'] >>>
You could also union two separate RDDs into one as shown below.
>>> rddTab1 = sc.textFile("/apps/hive/warehouse/table1/file1.tsv") >>> rddTab1.take(5) [u'1\tA', u'2\tB', u'3\tC'] >>> rddTab2 = sc.textFile("/apps/hive/warehouse/table2/file2.tsv") >>> rddTab2.take(5) [u'24\tX', u'25\tY', u'26\tZ'] >>> rddCombined = rddTab1.union(rddTab2) >>> rddCombined.count() 6 >>> rddCombined.take(6) [u'1\tA', u'2\tB', u'3\tC', u'24\tX', u'25\tY', u'26\tZ'] >>>
All that said, I think you were imagining doing this with the DataFrame API. Since they are already Hive tables you could just do a SQL union out of the box.
>>> dfTabs = sqlContext.sql("SELECT * FROM table1 UNION SELECT * FROM table2") >>> dfTabs.show() +---+------+ | id|letter| +---+------+ | 2| B| | 24| X| | 3| C| | 25| Y| | 26| Z| | 1| A| +---+------+
I hope this was helpful and addresses your question. Good luck and Happy Hadooping/Sparking!