question Re: Hive wordcount in Archives of Support Questions (Read Only)

Hive wordcount

vamsi123 — Wed, 02 Nov 2016 16:00:53 GMT

New to Hive and I followed link mentioned in below .Input will be appreciater

URL: http://hadooptutorial.info/java-vs-hive/

Input:-[docs]

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality– nodes manipulating the data they have access to – to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

Script for Hive in link:-
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

I have following clarifications on script

1)Script was aborted due to Invalid postscript error so i changed the create table statement as mentioned below. Please let me know what i am missing since in the orginal link the author did not face any error?

CREATE TABLE docs (line STRING) STORED AS TEXTFILE;

2)when i got Invalid postscript error file[docs] placed in HDFS home directory got deleted not sure why? Do I need to place the file each and everytime whenever i got Invalid postscript error?

3)The below create statement is creating table with below format.suppose if i want to change the settings to TEXTINPUTFORMAT how to change it?

SerDe Library:      	org.apache.hadoop.hive.ql.io.orc.OrcSerde	 
InputFormat:        	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

Re: Hive wordcount

rajkumar_singh — Tue, 27 Dec 2016 01:12:43 GMT

@vamsi valiveti could you please try this?

hive> create table x(yy string);
OK
Time taken: 2.155 seconds
hive> show create table x;
OK
CREATE TABLE `x`(
  `yy` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1482771845')
Time taken: 0.359 seconds, Fetched: 17 row(s)
hive> alter table x set fileformat inputformat "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" outputformat "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" serde "org.apache.hadoop.hive.ql.io.orc.OrcSerde";
OK
Time taken: 0.429 seconds
hive> show create table x;
OK
CREATE TABLE `x`(
  `yy` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x'
TBLPROPERTIES (
  'last_modified_by'='hive', 
  'last_modified_time'='1482772074', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1482772074')
Time taken: 0.087 seconds, Fetched: 18 row(s)
hive>

Re: Hive wordcount

cstanca — Tue, 27 Dec 2016 02:28:06 GMT

@vamsi valiveti

If you look at your statement you will see that group by word and order by word are confused because word is part of the subquery also part of the main query. You need to be explicit on which word you want to order and group by. What was the point to use the alias w, if you are not going to use it?

SELECT w.word, count(1) AS count FROM

(SELECT explode(split(line,'\\s')) AS word FROM docs) w

GROUP BY w.word

ORDER BY w.word;

Re: Hive wordcount

cstanca — Tue, 27 Dec 2016 02:38:51 GMT

Secondly, I recommend you to create it as an external table if you don't want to lose the file neither to change the data (just for view), otherwise you lose the file because this is a Hive ingest when you create it as an internal Hive table.