Member since
09-25-2015
112
Posts
88
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7721 | 03-15-2017 03:17 PM | |
4287 | 02-17-2017 01:03 PM | |
802 | 01-02-2017 10:47 PM | |
1282 | 11-16-2016 07:03 PM | |
565 | 10-07-2016 05:24 PM |
06-06-2018
10:48 AM
1 Kudo
Hi @Priyanka. There are a few issues with your file that are causing CSVSerde to not work correctly. The first issue is that the delimiter for age is in the wrong position. It needs to be moved one position to the right. That will solve the problem where the entire row was being loaded into the first field. Here is the changed file (with the first delimiter moved): "age"|"job""|""marital""|""education""|""default""
"58"|"management""|""married""|""tertiary""|""no"" Now the second issue is that starting with the "job" column there are extra quotes - the delimiters are quoted as well as the data. This will cause quotes to appear in your output. The delimiters should not be quoted. If you change your file is changed to look like this, it will no longer load any quotes into the output: "age"|"job"|"marital"|"education"|"default"
"58"|"management"|"married"|"tertiary"|"no" I hope this helps.
... View more
01-24-2018
03:28 PM
Hi @Ron Lee. Thanks for this writeup. One question - On a kerberized cluster will the keytabs need to be regenerated if the service account name is changed?
... View more
11-04-2017
12:19 PM
Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS. With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS. However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible. http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1eyvh2l3367ql.htm Use Case
Large Table, All Columns of Type String: Table A stored in Hive has 40 columns, all of type String, with 500M rows. By default, SAS Access converts String to $32K. So, 32K in length for char. The math for this size table yields 1.2MB row length x 500M rows. This causes the system to come to a halt - Too large to store in LASR or WORK. The following techniques can be used to work around the challenge in SAS, and they all work:
Use char and varchar in Hive instead of String. Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE. Add formatting to SAS code during inbound reads
example: Sequence Length 8 Informat 10. format 10. I hope this helps.
... View more
10-23-2017
03:33 AM
Hi @tanmoy. I would recommend you go the custom route - create UDF jars for Hive. That way you can use them anywhere you use Hive code. HPL/SQL is still not a GA feature, and in my opinion HPL/SQL use a little bit more time to "bake" and become more robust.
... View more
09-09-2017
03:43 PM
2 Kudos
Hi @Bala Vignesh N V. I know this is an old question, but I have encountered this recently. This answer may help someone else as well... The issue you had is most likely caused by specifying " COLLECTION ITEMS TERMINATED BY ',' ". When the table is defined like this (with COLLECTION ITEMS TERMINATED BY comma): -- Create a Dummy table to use in the insert query - like an Oracle DUAL table
create table dummy_TBL (col1 int) ;
insert into dummy_TBL (col1) values(1) ;
create table data_TBL (id int, name string, address struct<city:string,State:string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE;
insert into table data_TBL
select 1,
'Bala',
named_struct('city','Tampa','State','FL')
from dummy_TBL limit 1; The address.state value is NULL: +--------------+----------------+--------------------------------+--+
| data_tbl.id | data_tbl.name | data_tbl.address |
+--------------+----------------+--------------------------------+--+
| 1 | Bala | {"city":"Tampa","state":null} |
+--------------+----------------+--------------------------------+--+ But when you define the table like this (without COLLECTION ITEMS TERMINATED BY comma): create table data_TBL (id int, name string, address struct<city:string,State:string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
insert into table data_TBL
select 1,
'Bala',
named_struct('city','Tampa','State','FL')
from dummy_TBL limit 1; The address.state value is correct: +--------------+----------------+--------------------------------+--+
| data_tbl.id | data_tbl.name | data_tbl.address |
+--------------+----------------+--------------------------------+--+
| 1 | Bala | {"city":"Tampa","state":"FL"} |
+--------------+----------------+--------------------------------+--+
I hope this helps.
... View more
08-09-2017
09:56 AM
1 Kudo
Hi @avinash midatani. As mentioned in that other HCC post - this capability is not in Hive yet. The JIRA tracking the request is found here: https://issues.apache.org/jira/browse/HIVE-10593 The Spark code from @Alexander Bij found in the HCC post accomplishes that functionality - creating the Hive table structure automatically based on parquet file metadata. https://community.hortonworks.com/questions/5833/create-hive-table-to-read-parquet-files-from-parqu.html
... View more
08-08-2017
06:48 PM
Also read this HCC post for more information: https://community.hortonworks.com/questions/5833/create-hive-table-to-read-parquet-files-from-parqu.html
... View more
08-08-2017
06:45 PM
Hi @avinash midatani. I suspect the "LIKE PARQUET..." syntax is only valid in Impala.
Your CREATE TABLE SYNTAX might have to look more like this (with explicit column definitions and without the "LIKE PARQUET" block): CREATE EXTERNAL TABLE tbl_test (col1 datatype1, col2 datatype2, ..., coln datatype3)
STORED AS PARQUET
LOCATION '/test/kpi'; I hope this helps.
... View more
07-19-2017
07:54 PM
Thanks @Wynner!
... View more
07-19-2017
07:05 PM
Additional information from @Matt Clarke: There are three files crucial to a new node being able to successfully join an existing cluster....(flow.xml.gz, users.xml, and authorizations.xml). All three of these files (flow, users, authorizations) must match before a node will be allowed to join an existing cluster. The flow.xml.gz file contains everything you have added while interfacing with the UI. All nodes must have matching flow.xml.gz files in order to join cluster. All you need to do is copy flow.xml.gz file from the original cluster node to the new node, make sure
ownership is correct, and restart new node. Normally these files will be given out by the cluster to any new node who has none of them; however, if Ambari metrics are enabled and a flow.xml.gz does not exist, Ambari generates a flow.xml.gz file that contains only the Ambari reporting task. Because of this the new node will not match and will be unable to join the cluster. A NiFi cluster will never overwrite an existing flow.xml.gz file on a new node with its own. Secured NiFi clusters also requires that the users.xml and authorizations.xml file match if file based authorization is used. The users and authorizations XML files only come in to play when NiFi is secured and using the local file based authorization. If secured, the only time a cluster will hand out the users and authorizations XML files is if they don't exist as well.
Bottom line... If you add a new NiFi host via ambari, it will try to join cluster. If it fails and shuts back down, copy the above the files from one of the existing nodes to the new node and restart via Ambari.
... View more
06-28-2017
11:17 AM
We encountered a similar issue when upgrading our Ambari from 2.4 to 2.5. Our Kafka brokers would not restart. Here was the error message: /var/log/kafka/server.log.2017-06-27-19:java.lang.IllegalArgumentException: requirement failed: security.inter.broker.protocol must be a protocol in the configured set of advertised.listeners. The valid options based on currently configured protocols are Set(SASL_PLAINTEXT) We had specified PLAINTEXTSASL as the SASL protocol in the configuration. To fix this we changed the following configuration in Custom kafka-broker: security.inter.broker.protocol=SASL_PLAINTEXT
... View more
05-19-2017
07:21 PM
2 Kudos
Hi @viswanath kammula. The Hive With clause/CTE (Common Table Expression) does not allow that type of nesting in its' syntax. In the documentation: https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression It states "The WITH clause is not supported within SubQuery Blocks." What you are trying to do could be accomplished (with a little different syntax) if you use a permanent view, or a temporary table. Also note: that same link, look at the bottom section titled "CTE in Views, CTAS, and Insert Statements." It shows you how to use CTEs (with clause) along with Views. The "View example" is similar to what yo want to do.
Here is a code example of a simple CTE/With statement (for the benefit of other people who will read this post):
with date_table as (select distinct year, quarter, month, dayofmonth
from flight_performance_orc_part
where yearmonth = '2016-12' limit 4)
select * from date_table d ; -- Or any other complex query, joins to other tables, etc.
Results:
year, quarter, month, dayofmonth
2016 4 12 1
2016 4 12 2
2016 4 12 3
2016 4 12 4
I hope this helps.
... View more
04-11-2017
12:28 PM
Hi @suresh kumar c. It was a long time ago, but yes I believe it did fix the query. Here is some quick checking to confirm: %0D is a carriage return in this chart: http://www.degraeve.com/reference/urlencoding.php and then I looked it up in this chart https://en.wikipedia.org/wiki/Escape_sequences_in_C and saw that \r = carriage return = %0D. Whatever your control character is, look it up and make sure it is URL-escaped when you try to drop the partition. I hope this helps.
... View more
04-10-2017
07:51 PM
Hi @Ciarán Porter. There is a hotfix required for this issue. In the meantime you can use the beeline cli and send the output to a csv format. A good explanation is found here: https://community.hortonworks.com/questions/25789/how-to-dump-the-output-from-beeline.html
... View more
04-10-2017
07:51 PM
Thanks @apappu. I will request the hotfix for the customer.
... View more
04-10-2017
07:51 PM
Update: This has been fixed in Ambari 2.5 (released April 4th, 2017). An added bonus is that Ambari 2.5 includes the new Hive View 2.0 - a vastly improved GUI for developing Hive apps. See attached picture:
... View more
04-10-2017
07:51 PM
2 Kudos
**** Update: This has been fixed in Ambari 2.5 (released April 4, 2017). See below for more details. **** This has been observed in Ambari 2.4 using HDP 2.5.0/HDP 2.5.3.4-5. Customer is having the issue and I am able to replicate it. Using the Ambari Hive view, I'm running a simple select query. By using tpcds data, it is easy to spot the problem since the customer table has an auto-number row_id. select c_customer_sk, c_customer_id, c_last_name from customer limit 204; The query runs successfully & the results page in the browser every 50 rows. The first page shows rows 1..50 the second page shows rows 51..100, but the 3rd page starts with row 102. Row 101 is skipped on the output. The 3rd page shows rows 102..151 The 4th page shows rows 152..201 but the 5th page starts with row 203. Row 202 is skipped on the output.
The same pattern is displayed when using "Save Results" -> "Download as CSV" - every 101st row is skipped. In Beeline when I run the same query - no issues. All rows are displayed. Have attempted to set the following values in Ambari but they don't affect the results: set hive.cli.print.header=true; set showHeader false (beeline) set headerInterval 0 Thanks in advance for the assistance.
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
03-15-2017
03:17 PM
There has been an update to the Support web page. Now the specific Begin and End Dates for Minor Releases (2.3, 2.4, 2.5, etc) are readily available. You can see when Full Support and Technical Guidance periods end. It's in the table at the bottom of this page: https://hortonworks.com/agreements/support-services-policy/
... View more
03-10-2017
03:36 PM
3 Kudos
Hi @Florian Rabl. There is a very good discussion on internal and external tables - and what happens when you do or don't specify a location. It's located here - give it a look. https://community.hortonworks.com/questions/63991/hive-location.html My personal opinion is that only Internal tables should be stored in /apps/hive/warehouse, and that external tables should always be stored in a different location. But there is no hard & fast rule - I only favor that because it marks a clear line where internal and external tables are stored, and makes it easy to know just by looking at the filesystem if a table is internal or external. Also don't forget that internal and external table data is treated differently when you issue a 'DROP TABLE' statement: HDFS data corresponding to the internal table is deleted, while HDFS data corresponding to the External table is not deleted. So external table data (in /apps/hive/warehouse) would not be deleted, and ends up being a 'straggler' in the filesystem when you drop other Internal tables or databases.
... View more
02-17-2017
01:03 PM
1 Kudo
To add the @Artem Ervits's solution - consider saving your complex query in the source RDBMS as a view. Then just call Sqoop specifying the view name instead of a table name. Makes the code nice and clean.
... View more
02-13-2017
04:01 PM
Hi @Greg Frair. Given what @dtraver has mentioned above: You should be able to run hplsql commands from within that directory. /usr/hdp/current/hive-server2-hive2/bin/hplsql -e " <sql commands>" or /usr/hdp/current/hive-server2-hive2/bin/hplsql -f "<script filename>"
... View more
02-13-2017
01:09 PM
Good Morning @Greg Frair. I have not started up the sandbox but HPL/SQL should be there available for your use. If you cd to the executable directory you should be able to run hplsql via the command line. You could also add the executable directory to your $PATH variable. ./hplsql -f "query3.hpl"
./hplsql -e "declare ZP char(10); select ZP = min(zipPostal) from customer; print 'Smallest Zip Code is: ' || ZP;" I have a very simple github repository with sample database backups, flat files, and SQL Server/Hive DDL. It may be helpful for you. It also has a powerpoint that shows the results of executing the sample code on SQL Server and Hive. It also goes into more detail of how to setup and use HPL/SQL (changes to the hplsql and the hplsql-site.xml files). It can be found here: https://github.com/BillPreachuk/HPLSQL I hope this helps.
... View more
01-31-2017
08:40 PM
1 Kudo
@karthick baskaran I am not sure if you have the ability to use a 3rd Party tool, bit one of our trusted partners is Syncsort. If you've used the mainframe before you'll know who they are. Dealing with EBCDIC conversions, Copybooks, etc. are features that they excel at and provide in their flagship tool. It's called DMX-h and it would do what you need (in fact it can be your Data Integtration tool for all data, not just mainframe). http://www.syncsort.com/en/Products/BigData/DMXh
... View more
01-26-2017
08:18 PM
Hi @VENKATESH M I totally agree with @Kashif Khan's answer. Having a second table that you use for data cleansing and categorizing data is very valuable. You would have to keep that table up-to-date, but it is a nice way to do data-driven categorization. That categorization table could look something like this: Processor Processor_Category
------------ ---------------------
i210at i210
i210-at i210
i210 2016 i210
intel+ethernet+connection+i210 i210
wgi210it+s+ljxt i210
... Anyways... If you need a quick and dirty way to aggregate the exact query you provided, you can do it this way: SELECT 'i210' AS Processor,
SUM(processor) AS Total
from processor-types
where (processor like '%i210%')
and ddate like '20161009'
group by 'i210';
Processor Total
--------- ------
i210 42
I hope this helps.
... View more
01-26-2017
07:49 PM
Hi
@Reddy. This data (recent partitions created) is available in the Hive metastore.
Here is a very helpful Hive Metastore ER Diagram: (from @Rajkumar Singh) https://community.hortonworks.com/questions/57036/is-there-a-updated-hive-121-metastore-er-diagram.html
Here is an excellent post on querying the metastore: http://www.openkb.info/2015/04/how-to-list-table-or-partition-location.html If you have access to the Hive Metastore RDBMS, you can directly query the relational tables to get the info you need. Here is a sample MySQL query to get recent partitions... select DBS.NAME as DB_NAME, TBLS.TBL_NAME, PARTITIONS.PART_NAME, SDS.LOCATION,
from_unixtime(PARTITIONS.CREATE_TIME) as CREATE_TIME_FORMATTED, PARTITIONS.CREATE_TIME
from SDS,TBLS,PARTITIONS,DBS
where PARTITIONS.SD_ID = SDS.SD_ID
and TBLS.TBL_ID=PARTITIONS.TBL_ID
and TBLS.DB_ID=DBS.DB_ID
and TBLS.TBL_NAME='inventory'
order by PARTITIONS.CREATE_TIME DESC, DBS.NAME, TBLS.TBL_NAME
LIMIT 3;
| DB_NAME | TBL_NAME | PART_NAME | LOCATION | CREATE_TIME_FORMATTED | CREATE_TIME |
+-----------------------------+-----------+---------------------+-------------------------------
| tpcds_bin_partitioned_orc_3 | inventory | inv_date_sk=2451760 | <FQDN>:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_3.db/inventory/inv_date_sk=2451760 | 2016-12-06 11:39:43 | 1481053183 |
| tpcds_bin_partitioned_orc_3 | inventory | inv_date_sk=2451445 | <FQDN>:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_3.db/inventory/inv_date_sk=2451445 | 2016-12-06 11:39:43 | 1481053183 |
| tpcds_bin_partitioned_orc_3 | inventory | inv_date_sk=2451326 | <FQDN>:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_3.db/inventory/inv_date_sk=2451326 | 2016-12-06 11:39:43 | 1481053183 |
But again - this solution requires you to have access to the MySQL metastore RDBMS.
You can also access the Hive metastore data from within the cluster - if you use the hive metatool. When you query the metastore using the metatool - you have to use ORM query language. It is clunky and not well-documented.
If I can get this kind of query working using the hive metatool (using ORM QL) I will edit & add to this post. Here is a high-level example of using the Hive metatool.
(from @gopal) HIVE_CONF_DIR=/etc/hive/conf/conf.server/
hive --service metatool -executeJDOQL "select name from org.apache.hadoop.hive.metastore.model.MDatabase"
hive --service metatool -executeJDOQL "select database.name + '.' + tableName from org.apache.hadoop.hive.metastore.model.MTable"
You can find the ORM data layouts here: https://github.com/apache/hive/blob/master/metastore/src/model/package.jdo I hope this helps.
... View more
01-23-2017
03:11 PM
Here is a very good blog showing a performance comparison of Hive LLAP to Impala. It uses 10TB of TPC-DS Data Warehouse data to get a solid apples-to-apple comparison of the two. http://hortonworks.com/blog/apache-hive-vs-apache-impala-query-performance-comparison/
... View more
01-17-2017
06:49 PM
Hi @Naren Reddy. @Eugene Koifman is correct - ACID tables are not designed for frequent OLTP-like transactions... It is much more optimal for tables that get their deletes/updates applied every 15 minutes or longer.
... View more
01-16-2017
03:26 PM
Hi @Rohit Sharma. Did you use the 'EXTERNAL' keyword when you created the table? If you don't specify 'EXTERNAL' then it is an internal table and the data will be deleted, regardless of what location you specify for the data...
... View more
01-04-2017
07:23 PM
Hi @Bala Vignesh N V. Lester Martin has an excellent Pig script to do this type of work. It is not an external table solution but a good way to do this type of work... https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=21299205
... View more