Member since
07-11-2016
19
Posts
12
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1036 | 08-22-2017 04:23 PM | |
2068 | 04-20-2017 07:24 PM | |
11857 | 08-31-2016 07:19 PM |
08-28-2017
03:11 PM
Ack! No, don't add "trust". And "host all <user_name> 0.0.0.0/0 trust" means that the user_name you picked can log into the database without a password. Plus, you made the user a superuper. That is bad idea. Instead, add this to the end of the pg_hba.conf file on the master. host all all 0.0.0.0/0 md5 This means it will use an encrypted password. When you create your user, use this: create user <user_name> identified by 'your_password'; or, if you already created your user: alter user <user_name> identified by 'new_password'; And you don't have to restart the cluster. hawq stop -u That will update the cluster with the new pg_hba.conf file.
... View more
08-22-2017
04:23 PM
1 Kudo
Typically, the Ambari plug-in that is used to install and manage HAWQ is tied to the Ambari version. As you pointed out, HAWQ 2.2.0.0 requires Ambari v2.4.2 with HDP v2.5.3. Alternatively, you could install HAWQ by rpm and not use Ambari at all. This will require a great deal more knowledge of how HAWQ works and Ambari will not have HAWQ listed at all. The requirement here is just the HDP version. You'll also have to manage HAWQ outside of Ambari. https://hdb.docs.pivotal.io/220/hdb/install/install-cli.html
... View more
08-09-2017
07:51 PM
The PutSQL option is by far the worst option. This is the wrong answer. Please, don't do that. Hellmer clearly doesn't use HAWQ. I wrote a blog post over a year ago on this subject. The singleton INSERT solution that is common for OLTP databases is the absolute worst solution for HAWQ. In other words, "INSERT INTO table_name (columns...) VALUES (...);" Here is my blog post covering loading data into HAWQ: http://www.pivotalguru.com/?p=1425
... View more
08-09-2017
07:18 PM
Is this equivalent to: INSERT INTO <table> (col1, col2, col3) VALUES (?, ?, ?);
... View more
08-08-2017
08:18 PM
You definitely do not want to execute an INSERT for every row of data. That will be really, really slow. I suggest you use the gpfdist protocol to load the log files directly into HAWQ. You start a gpfdist process where the files are located, create an external table that uses the gpfdist location, and then execute a single INSERT statement to load all of the data from the files into your HAWQ table.
... View more
05-17-2017
03:14 PM
Make sure you have RANDOM_DISTRIBUTION="true" in the tpcds_variables.sh file. Make sure you have memory set correctly. With 128GB of RAM, you should set the following: hawq_rm_memory_limit_perseg = 121 In Ambari, this parameter is set in the "Ambari Memory Usage Limit" textbox. Set hawq_rm_stmt_vseg_memory to '16gb' in the hawq-site.xml file. This is done in the "Custom hawq-site" section in Ambari. sysctl.conf file: vm.overcommit=2 vm.overcommit_ratio=100 Create 4GB swap file on each node. Make sure you have a temp directory for each drive. By default, there will only be one but use all of the disks you have available. This is just a comma delimited list of directories on each node and you will have to create the directories manually. Change hawq_rm_nvseg_perquery_perseg_limit from the default of 6 to 8, 10, 12, 14, or 16 to improve performance. This specifies how many virtual segments (vseg) are allowed to be created when a query is executed. Each vseg will consume resources so there will be a point of diminishing returns as your disk, memory, or CPU become the bottleneck. The concurrency test runs by default. It is set to 5 users in the tpcds_variables.sh file.
... View more
04-20-2017
07:24 PM
1 Kudo
Why did you cherrypick the queries instead of running all 99? The databricks queries are not dynamic as the TPC-DS benchmark calls for. Instead, they are using constant values which may help them in tuning their queries. 500GB and 1TB are rather small. I guess the goal here is to keep the data in memory? You have enough RAM to keep the largest table in memory which would definitely skew things towards Spark. If you are only concerned with single user query performance, there are some configuration changes you can make to get HAWQ to perform better. How did the concurrency tests perform? I noticed in the databricks DDL that they don't have a single bigint datatype listed. ss_ticket_number in store_sales for instance, will need to be a bigint when you make the scale 3TB. I'm assuming they don't intend to test at scale? Databricks is using a fork of dsdgen to generate the data. Why is that? What are they changing? They also label it in Github as "1.1". Did they fork from the older and deprecated version of TPC-DS? The HAWQ benchmark is using the unmodified TPC-DS toolkit version 2.1.0. I've seen some benchmarks from other vendors misrepresent the execution time. They will capture the time reported by the database rather than the time it takes to start the "xxxxxx-shell" program, execute the query, and then present the results. I've seen reported execution times of 1-2 seconds but the clock time be at least 10 seconds more than that. You may want to verify Spark is really executing as fast as reported. For the HAWQ benchmark, it is the actual time it takes from start to finish just as the TPC-DS benchmark calls for.
... View more
01-13-2017
01:39 PM
1 Kudo
Some corrections. 1. You shouldn't use the PostgreSQL version of psql with HAWQ. While it may work, you should use the one distributed with HAWQ. You'll also want other utilities like gpfdist which are distributed with HAWQ and not part of PostgreSQL. You'll instead want to use rpm to install the utilities on an edge node. rpm -i hawq-2.1.1.0-7.el6.x86_64.rpm 2. Do not use pg_ctl with HAWQ. You should either use Ambari to restart the HAWQ service or use the "hawq" command in a terminal window. pg_ctl will probably stop being distributed with HAWQ in the future. hawq stop cluster -u -a -u means to update the config -a means to do it silently 3. An easier way to allow external connections is to add this to the end of the pg_hba.conf file: host all all 0.0.0.0/0 md5 This means that all external connections will require an encrypted password to authenticate. This is the database password too and not the operating system password. psql -c "alter user gpadmin password 'secret'" That will change the gpadmin password in the database to secret.
... View more
09-01-2016
03:32 PM
It is pretty simple to configure: http://hdb.docs.pivotal.io/20/pxf/ConfigurePXF.html#topic_i3f_hvm_ss
... View more
09-01-2016
03:00 PM
HAWQ's external table supports secured clusters. But I would always prefer to just create a single external table and then run a typical "INSERT INTO ext_table SELECT * FROM hawq_table;" because it would be so much faster to work with. Speaking of speed, I haven't heard of anyone moving data from HAWQ to Hive. It is always the other way round! HAWQ is so much faster and has better SQL support than Hive.
... View more
09-01-2016
02:43 PM
1 Kudo
What is a "Primary key partition"? I've never heard of that before. - HAWQ doesn't support indexes so it doesn't support a Primary Key constraint. - HAWQ does have table partitioning which can be a list or a range of values. Most commonly, a partition will be based on a date or timestamp column and have a range of entire month, quarter, or year. This is similar to Hive partitioning but Hive can't partition on a range of values like HAWQ can. - HAWQ also has table distribution which can be either random or a hash of a column or columns. With HAWQ 2.0, it is recommended to use random distribution. So your question is how to migrate data from HAWQ to Hive. First off, sqoop would be pretty slow. It would be a single process to unload data. I would never recommend using sqoop for something like this. Instead, you should use a Writable External Table in HAWQ that writes, in parallel, directly to HDFS. http://hdb.docs.pivotal.io/20/pxf/PXFExternalTableandAPIReference.html
... View more
08-31-2016
07:19 PM
2 Kudos
I figured it out. I had to do the following: 1. Add slf4j-api.jar, slf4j-log4j12.jar, log4j.jar, hive-jdbc.jar, and hadoop-common.jar to my class_path. 2. Create a log4j.properties file with: log4j.rootLogger=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.conversionPattern=%5p [%t] (%F:%L) - %m%n
3. Add the path to a log4j.properties file: -Dlog4j.configuration=file:/path/to/log4j.properties That is a lot of work just to change the logging level. I wish Hive would default to only showing ERROR and FATAL errors instead of showing all WARN and INFO messages.
... View more
08-31-2016
04:38 PM
Your example also displays the unwanted log4j WARN messages.
... View more
08-31-2016
12:07 AM
I tried adding it as otherSessionConfs and as hiveConfs but neither worked. As hiveConfs, I got this error message: Failed to open new session: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: hive configuration hive.root.logger does not exists.
... View more
08-30-2016
10:10 PM
2 Kudos
I am making a connection to Hive in Java with the following: Connection conn = DriverManager.getConnection(connectionUrl, userName, password); I then immediately get these INFO messages: Aug 30, 2016 5:54:53 PM org.apache.hive.jdbc.Utils parseURL
INFO: Supplied authorities: hdb:10000
Aug 30, 2016 5:54:53 PM org.apache.hive.jdbc.Utils parseURL
INFO: Resolved authority: hdb:10000
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
On a command line, these messages go to STDERR so I could redirect it to /dev/null but that also means I redirect all errors. I would then lose all error message being sent to STDERR which is not good. Is there a URL connection option or something similar, to suppress INFO and WARNING messages when Hive makes a connection?
... View more
Labels:
- Labels:
-
Apache Hive
07-11-2016
02:56 PM
I also had to fix the permissions in HDFS as the user pxf. (hdfs dfs -chmod -R 777 /kirk) --hive create table ext_kirk (
ID INT,
NAME VARCHAR(65355),
AGE INT,
ADDRESS CHAR(50),
SALARY decimal,
JOIN_DATE DATE,
PRICE decimal,
a char(3),
b varchar(5),
i varchar(100),
ts timestamp,
dec decimal(32,2),
doub decimal,
cv varchar(65355),
t varchar(15),
c char(7),
v varchar(20),
isactive boolean
)
row format delimited fields terminated by '|'
location '/kirk'; beeline -u jdbc:hive2://hdb:10000 -n gpadmin -d org.apache.hive.jdbc.HiveDriver -e "select * from ext_kirk" WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://hdb:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+--------------+----------------+---------------+-----------------------------------------------------+------------------+---------------------+-----------------+-------------+-------------+-------------+------------------------+---------------+----------------+---------------+-------------+-------------+-------------+--------------------+--+
| ext_kirk.id | ext_kirk.name | ext_kirk.age | ext_kirk.address | ext_kirk.salary | ext_kirk.join_date | ext_kirk.price | ext_kirk.a | ext_kirk.b | ext_kirk.i | ext_kirk.ts | ext_kirk.dec | ext_kirk.doub | ext_kirk.cv | ext_kirk.t | ext_kirk.c | ext_kirk.v | ext_kirk.isactive |
+--------------+----------------+---------------+-----------------------------------------------------+------------------+---------------------+-----------------+-------------+-------------+-------------+------------------------+---------------+----------------+---------------+-------------+-------------+-------------+--------------------+--+
| 4 | Mark | 25 | 12 Rich-Mond | 65000 | 2007-12-13 | 12 | 101 | 100 | 00:05:35 | 2016-02-22 08:01:23.0 | 33.33 | 0 | sdfsadf | 08:01:23 | we | varchar | false |
| 5 | David | 27 | Texas | 85000 | 2007-12-13 | 45 | 101 | 100 | 00:12:55 | 2016-02-22 08:01:23.0 | 33.56 | 0 | white space | 08:01:23 | we | varchar | true |
+--------------+----------------+---------------+-----------------------------------------------------+------------------+---------------------+-----------------+-------------+-------------+-------------+------------------------+---------------+----------------+---------------+-------------+-------------+-------------+--------------------+--+
2 rows selected (0.179 seconds)
Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive
Closing: 0: jdbc:hive2://hdb:10000
... View more
07-11-2016
02:48 PM
I would not recommend using Sqoop to pull data out of HAWQ. It will work but it won't scale so it will be slow when working with any sizable amount of data. The much better way to handle this is to leverage PXF to write data from HAWQ directly to HDFS in parallel. You are also working with SQL so it is will be very easy to transform the data into a format that other tools can consume. Below, I handled the boolean column. --HAWQ create table kirk (
ID INT,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL,
JOIN_DATE DATE,
PRICE NUMERIC,
a BIT(3),
b BIT VARYING(5),
i interval,
ts timestamp,
dec decimal(32,2),
doub double precision,
cv character varying,
t time,
c char(7),
v varchar(20),
isactive boolean
)
distributed randomly; INSERT INTO kirk (ID,NAME,AGE,ADDRESS,SALARY,JOIN_DATE, PRICE, a,b, i, ts,dec,doub,cv,t,c,v, isactive) VALUES
(4, 'Mark', 25, '12 Rich-Mond ', 65000.00, '2007-12-13', 12.44, B'101',B'100', '335','2016-02-22 08:01:23.000',33.3333,0.009,'sdfsadf', '08:01:23.000', 'we ', 'varchar', '0'),
(5, 'David', 27, 'Texas', 85000.00 , '2007-12-13', 45.44,B'101',B'100','775', '2016-02-22 08:01:23.000',33.5555, 0.008, 'white space ','08:01:23.000', 'we', 'varchar ', 'true' ); create writable external table ext_kirk
(
ID INT,
NAME TEXT,
AGE INT,
ADDRESS CHAR(50),
SALARY REAL,
JOIN_DATE DATE,
PRICE NUMERIC,
a BIT(3),
b BIT VARYING(5),
i interval,
ts timestamp,
dec decimal(32,2),
doub double precision,
cv character varying,
t time,
c char(7),
v varchar(20),
isactive varchar(5)
)
location ('pxf://hdb:51200/kirk?Profile=HdfsTextSimple')
format 'text' (delimiter = '|')
distributed randomly; insert into ext_kirk (ID,NAME,AGE,ADDRESS,SALARY,JOIN_DATE, PRICE, a,b, i, ts,dec,doub,cv,t,c,v, isactive)
select ID,NAME,AGE,ADDRESS,SALARY,JOIN_DATE, PRICE, a,b, i, ts,dec,doub,cv,t,c,v, case when isactive then 'TRUE' else 'FALSE' end from kirk;
... View more