Created 11-02-2016 09:00 AM
New to Hive and I followed link mentioned in below .Input will be appreciater
URL: http://hadooptutorial.info/java-vs-hive/
Input:-[docs]
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality– nodes manipulating the data they have access to – to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking
Script for Hive in link:- CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
I have following clarifications on script
1)Script was aborted due to Invalid postscript error so i changed the create table statement as mentioned below. Please let me know what i am missing since in the orginal link the author did not face any error?
CREATE TABLE docs (line STRING) STORED AS TEXTFILE;
2)when i got Invalid postscript error file[docs] placed in HDFS home directory got deleted not sure why? Do I need to place the file each and everytime whenever i got Invalid postscript error?
3)The below create statement is creating table with below format.suppose if i want to change the settings to TEXTINPUTFORMAT how to change it?
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Created 12-26-2016 06:28 PM
If you look at your statement you will see that group by word and order by word are confused because word is part of the subquery also part of the main query. You need to be explicit on which word you want to order and group by. What was the point to use the alias w, if you are not going to use it?
SELECT w.word, count(1) AS count FROM
(SELECT explode(split(line,'\\s')) AS word FROM docs) w
GROUP BY w.word
ORDER BY w.word;
Created 12-26-2016 05:12 PM
@vamsi valiveti could you please try this?
hive> create table x(yy string); OK Time taken: 2.155 seconds hive> show create table x; OK CREATE TABLE `x`( `yy` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1482771845') Time taken: 0.359 seconds, Fetched: 17 row(s) hive> alter table x set fileformat inputformat "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" outputformat "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" serde "org.apache.hadoop.hive.ql.io.orc.OrcSerde"; OK Time taken: 0.429 seconds hive> show create table x; OK CREATE TABLE `x`( `yy` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x' TBLPROPERTIES ( 'last_modified_by'='hive', 'last_modified_time'='1482772074', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1482772074') Time taken: 0.087 seconds, Fetched: 18 row(s) hive>
Created 12-26-2016 06:28 PM
If you look at your statement you will see that group by word and order by word are confused because word is part of the subquery also part of the main query. You need to be explicit on which word you want to order and group by. What was the point to use the alias w, if you are not going to use it?
SELECT w.word, count(1) AS count FROM
(SELECT explode(split(line,'\\s')) AS word FROM docs) w
GROUP BY w.word
ORDER BY w.word;
Created 12-26-2016 06:38 PM
Secondly, I recommend you to create it as an external table if you don't want to lose the file neither to change the data (just for view), otherwise you lose the file because this is a Hive ingest when you create it as an internal Hive table.