<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hive wordcount in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160779#M45086</link>
    <description>&lt;P&gt;Secondly, I recommend you to create it as an external table if you don't want to lose the file neither to change the data (just for view), otherwise you lose the file because this is a Hive ingest when you create it as an internal Hive table. &lt;/P&gt;</description>
    <pubDate>Tue, 27 Dec 2016 02:38:51 GMT</pubDate>
    <dc:creator>cstanca</dc:creator>
    <dc:date>2016-12-27T02:38:51Z</dc:date>
    <item>
      <title>Hive wordcount</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160776#M45083</link>
      <description>&lt;P&gt;New to Hive and I followed link mentioned in below .Input will be appreciater&lt;/P&gt;&lt;P&gt;URL: &lt;A href="http://hadooptutorial.info/java-vs-hive/" target="_blank"&gt;http://hadooptutorial.info/java-vs-hive/&lt;/A&gt; &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Input:-[docs] &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality– nodes manipulating the data they have access to – to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking&lt;/P&gt;&lt;PRE&gt;Script for Hive in link:-
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;I have following clarifications on script &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;1)Script was aborted due to Invalid postscript error so i changed the create table statement as mentioned below. Please let me know what i am missing since in the orginal link the author did not face any error?&lt;/P&gt;&lt;P&gt;CREATE TABLE docs (line STRING) STORED AS TEXTFILE; &lt;/P&gt;&lt;P&gt;2)when i got Invalid postscript error file[docs] placed in HDFS home directory got deleted not sure why? Do I need to place the file each and everytime whenever i got Invalid postscript error?&lt;/P&gt;&lt;P&gt;3)The below create statement is creating table with below format.suppose if i want to change the settings to TEXTINPUTFORMAT how to change it?
&lt;/P&gt;&lt;PRE&gt;SerDe Library:      	org.apache.hadoop.hive.ql.io.orc.OrcSerde	 
InputFormat:        	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
&lt;/PRE&gt;</description>
      <pubDate>Wed, 02 Nov 2016 16:00:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160776#M45083</guid>
      <dc:creator>vamsi123</dc:creator>
      <dc:date>2016-11-02T16:00:53Z</dc:date>
    </item>
    <item>
      <title>Re: Hive wordcount</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160777#M45084</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/9789/vamsivalivetiedu.html" nodeid="9789"&gt;@vamsi valiveti&lt;/A&gt; could you please try this?&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; create table x(yy string);
OK
Time taken: 2.155 seconds
hive&amp;gt; show create table x;
OK
CREATE TABLE `x`(
  `yy` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1482771845')
Time taken: 0.359 seconds, Fetched: 17 row(s)
hive&amp;gt; alter table x set fileformat inputformat "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" outputformat "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" serde "org.apache.hadoop.hive.ql.io.orc.OrcSerde";
OK
Time taken: 0.429 seconds
hive&amp;gt; show create table x;
OK
CREATE TABLE `x`(
  `yy` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rkk1.hdp.local:8020/apps/hive/warehouse/x'
TBLPROPERTIES (
  'last_modified_by'='hive', 
  'last_modified_time'='1482772074', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1482772074')
Time taken: 0.087 seconds, Fetched: 18 row(s)
hive&amp;gt; 


&lt;/PRE&gt;</description>
      <pubDate>Tue, 27 Dec 2016 01:12:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160777#M45084</guid>
      <dc:creator>rajkumar_singh</dc:creator>
      <dc:date>2016-12-27T01:12:43Z</dc:date>
    </item>
    <item>
      <title>Re: Hive wordcount</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160778#M45085</link>
      <description>&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/9789/vamsivalivetiedu.html"&gt;@vamsi valiveti&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/9789/vamsivalivetiedu.html"&gt;&lt;/A&gt;If you look at your statement you will see that group by &lt;STRONG&gt;word&lt;/STRONG&gt; and order by &lt;STRONG&gt;word&lt;/STRONG&gt; are confused because &lt;STRONG&gt;word&lt;/STRONG&gt; is part of the subquery also part of the main query. You need to be explicit on which word you want to order and group by. What was the point to use the alias &lt;STRONG&gt;w,&lt;/STRONG&gt; if you are not going to use it?&lt;/P&gt;&lt;P&gt;SELECT w.word, count(1) AS count FROM&lt;/P&gt;&lt;P&gt;(SELECT explode(split(line,'\\s')) AS word FROM docs) w&lt;/P&gt;&lt;P&gt;GROUP BY w.word&lt;/P&gt;&lt;P&gt;ORDER BY w.word;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2016 02:28:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160778#M45085</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-12-27T02:28:06Z</dc:date>
    </item>
    <item>
      <title>Re: Hive wordcount</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160779#M45086</link>
      <description>&lt;P&gt;Secondly, I recommend you to create it as an external table if you don't want to lose the file neither to change the data (just for view), otherwise you lose the file because this is a Hive ingest when you create it as an internal Hive table. &lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2016 02:38:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-wordcount/m-p/160779#M45086</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-12-27T02:38:51Z</dc:date>
    </item>
  </channel>
</rss>

