- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive wordcount
- Labels:
-
Apache Hive
Created ‎11-02-2016 09:00 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
New to Hive and I followed link mentioned in below .Input will be appreciater
URL: http://hadooptutorial.info/java-vs-hive/
Input:-[docs]
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality– nodes manipulating the data they have access to – to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking
Script for Hive in link:- CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
I have following clarifications on script
1)Script was aborted due to Invalid postscript error so i changed the create table statement as mentioned below. Please let me know what i am missing since in the orginal link the author did not face any error?
CREATE TABLE docs (line STRING) STORED AS TEXTFILE;
2)when i got Invalid postscript error file[docs] placed in HDFS home directory got deleted not sure why? Do I need to place the file each and everytime whenever i got Invalid postscript error?
3)The below create statement is creating table with below format.suppose if i want to change the settings to TEXTINPUTFORMAT how to change it?
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Created ‎12-26-2016 06:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you look at your statement you will see that group by word and order by word are confused because word is part of the subquery also part of the main query. You need to be explicit on which word you want to order and group by. What was the point to use the alias w, if you are not going to use it?
SELECT w.word, count(1) AS count FROM
(SELECT explode(split(line,'\\s')) AS word FROM docs) w
GROUP BY w.word
ORDER BY w.word;
Created ‎12-26-2016 05:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@vamsi valiveti could you please try this?
hive> create table x(yy string); OK Time taken: 2.155 seconds hive> show create table x; OK CREATE TABLE `x`( `yy` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1482771845') Time taken: 0.359 seconds, Fetched: 17 row(s) hive> alter table x set fileformat inputformat "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" outputformat "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" serde "org.apache.hadoop.hive.ql.io.orc.OrcSerde"; OK Time taken: 0.429 seconds hive> show create table x; OK CREATE TABLE `x`( `yy` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x' TBLPROPERTIES ( 'last_modified_by'='hive', 'last_modified_time'='1482772074', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1482772074') Time taken: 0.087 seconds, Fetched: 18 row(s) hive>
Created ‎12-26-2016 06:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you look at your statement you will see that group by word and order by word are confused because word is part of the subquery also part of the main query. You need to be explicit on which word you want to order and group by. What was the point to use the alias w, if you are not going to use it?
SELECT w.word, count(1) AS count FROM
(SELECT explode(split(line,'\\s')) AS word FROM docs) w
GROUP BY w.word
ORDER BY w.word;
Created ‎12-26-2016 06:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Secondly, I recommend you to create it as an external table if you don't want to lose the file neither to change the data (just for view), otherwise you lose the file because this is a Hive ingest when you create it as an internal Hive table.
