Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

how to add two files using python

input1:create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; create external table db.emp1(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; input2:create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../';; create table db.emp1(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; requeried output:create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; these two files stored under file1.hql ****** create external table db.emp1(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; create external table db.emp(id int,name string) row formatted fields terminated by ',' location 'hadfs:.../'; these two files stored under file2.hql

and so on

i tried with: filenames = ['in1.txt', 'in2.txt'] with open('result.txt', 'w') as outfile: for fname in filenames: with open(fname) as infile: content = infile.read().replace('\n', '') outfile.write(content)

but i didn't get requeried output. Please give me a hint how to achieve this.

1 REPLY 1

I'll try this a couple of ways. First, I got two files loaded into HDFS.

[maria_dev@sandbox 103650]$ hdfs dfs -ls /user/maria_dev/hcc/103650/
Found 2 items
-rw-r--r--   1 maria_dev hdfs         12 2017-05-21 21:21 /user/maria_dev/hcc/103650/file1.tsv
-rw-r--r--   1 maria_dev hdfs         15 2017-05-21 21:21 /user/maria_dev/hcc/103650/file2.tsv
[maria_dev@sandbox 103650]$ hdfs dfs -cat hcc/103650/file1.tsv
1	A
2	B
3	C
[maria_dev@sandbox 103650]$ hdfs dfs -cat hcc/103650/file2.tsv
24	X
25	Y
26	Z
[maria_dev@sandbox 103650]$ 

I then ran the following DDL and the appropriate load statement to sweep those earlier files into the tables.

CREATE TABLE table1 (
  id INT,
  letter STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

CREATE TABLE table2 (
  id INT,
  letter STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

LOAD DATA INPATH '/user/maria_dev/hcc/103650/file1.tsv' 
  OVERWRITE INTO TABLE table1;

LOAD DATA INPATH '/user/maria_dev/hcc/103650/file2.tsv'
  OVERWRITE INTO TABLE table2; 

Here's the output from the SELECT statements as well as cat'ing the moved files.

hive> select * from table1;
OK
1	A
2	B
3	C
Time taken: 1.738 seconds, Fetched: 3 row(s)
hive> select * from table2;
OK
24	X
25	Y
26	Z
Time taken: 0.186 seconds, Fetched: 3 row(s)
hive> 

[maria_dev@sandbox 103650]$ hdfs dfs -cat /apps/hive/warehouse/table1/file1.tsv
1	A
2	B
3	C
[maria_dev@sandbox 103650]$ hdfs dfs -cat /apps/hive/warehouse/table2/file2.tsv
24	X
25	Y
26	Z
[maria_dev@sandbox 103650]$ 

Now... that should get me caught up to your problem which I understand is simply that you need to read two (or more) files into a single Spark dataset.

Here is doing that with the RDD API.

>>> rddTabs = sc.textFile("/apps/hive/warehouse/table1/file1.tsv,/apps/hive/warehouse/table2/file2.tsv")
>>> rddTabs.count()
6
>>> rddTabs.take(6)
[u'1\tA', u'2\tB', u'3\tC', u'24\tX', u'25\tY', u'26\tZ']
>>> 

You could also union two separate RDDs into one as shown below.

>>> rddTab1 = sc.textFile("/apps/hive/warehouse/table1/file1.tsv")
>>> rddTab1.take(5)
[u'1\tA', u'2\tB', u'3\tC']
>>> rddTab2 = sc.textFile("/apps/hive/warehouse/table2/file2.tsv")
>>> rddTab2.take(5)
[u'24\tX', u'25\tY', u'26\tZ']
>>> rddCombined = rddTab1.union(rddTab2)
>>> rddCombined.count()
6
>>> rddCombined.take(6)
[u'1\tA', u'2\tB', u'3\tC', u'24\tX', u'25\tY', u'26\tZ']
>>> 

All that said, I think you were imagining doing this with the DataFrame API. Since they are already Hive tables you could just do a SQL union out of the box.

>>> dfTabs = sqlContext.sql("SELECT * FROM table1 UNION SELECT * FROM table2")
>>> dfTabs.show()
+---+------+
| id|letter|
+---+------+
|  2|     B|
| 24|     X|
|  3|     C|
| 25|     Y|
| 26|     Z|
|  1|     A|
+---+------+

I hope this was helpful and addresses your question. Good luck and Happy Hadooping/Sparking!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.