Member since
07-11-2016
19
Posts
12
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1669 | 08-22-2017 04:23 PM | |
3869 | 04-20-2017 07:24 PM | |
16473 | 08-31-2016 07:19 PM |
08-28-2017
03:11 PM
Ack! No, don't add "trust". And "host all <user_name> 0.0.0.0/0 trust" means that the user_name you picked can log into the database without a password. Plus, you made the user a superuper. That is bad idea. Instead, add this to the end of the pg_hba.conf file on the master. host all all 0.0.0.0/0 md5 This means it will use an encrypted password. When you create your user, use this: create user <user_name> identified by 'your_password'; or, if you already created your user: alter user <user_name> identified by 'new_password'; And you don't have to restart the cluster. hawq stop -u That will update the cluster with the new pg_hba.conf file.
... View more
08-22-2017
04:23 PM
1 Kudo
Typically, the Ambari plug-in that is used to install and manage HAWQ is tied to the Ambari version. As you pointed out, HAWQ 2.2.0.0 requires Ambari v2.4.2 with HDP v2.5.3. Alternatively, you could install HAWQ by rpm and not use Ambari at all. This will require a great deal more knowledge of how HAWQ works and Ambari will not have HAWQ listed at all. The requirement here is just the HDP version. You'll also have to manage HAWQ outside of Ambari. https://hdb.docs.pivotal.io/220/hdb/install/install-cli.html
... View more
08-09-2017
07:51 PM
The PutSQL option is by far the worst option. This is the wrong answer. Please, don't do that. Hellmer clearly doesn't use HAWQ. I wrote a blog post over a year ago on this subject. The singleton INSERT solution that is common for OLTP databases is the absolute worst solution for HAWQ. In other words, "INSERT INTO table_name (columns...) VALUES (...);" Here is my blog post covering loading data into HAWQ: http://www.pivotalguru.com/?p=1425
... View more
08-09-2017
07:18 PM
Is this equivalent to: INSERT INTO <table> (col1, col2, col3) VALUES (?, ?, ?);
... View more
08-08-2017
08:18 PM
You definitely do not want to execute an INSERT for every row of data. That will be really, really slow. I suggest you use the gpfdist protocol to load the log files directly into HAWQ. You start a gpfdist process where the files are located, create an external table that uses the gpfdist location, and then execute a single INSERT statement to load all of the data from the files into your HAWQ table.
... View more
05-17-2017
03:14 PM
Make sure you have RANDOM_DISTRIBUTION="true" in the tpcds_variables.sh file. Make sure you have memory set correctly. With 128GB of RAM, you should set the following: hawq_rm_memory_limit_perseg = 121 In Ambari, this parameter is set in the "Ambari Memory Usage Limit" textbox. Set hawq_rm_stmt_vseg_memory to '16gb' in the hawq-site.xml file. This is done in the "Custom hawq-site" section in Ambari. sysctl.conf file: vm.overcommit=2 vm.overcommit_ratio=100 Create 4GB swap file on each node. Make sure you have a temp directory for each drive. By default, there will only be one but use all of the disks you have available. This is just a comma delimited list of directories on each node and you will have to create the directories manually. Change hawq_rm_nvseg_perquery_perseg_limit from the default of 6 to 8, 10, 12, 14, or 16 to improve performance. This specifies how many virtual segments (vseg) are allowed to be created when a query is executed. Each vseg will consume resources so there will be a point of diminishing returns as your disk, memory, or CPU become the bottleneck. The concurrency test runs by default. It is set to 5 users in the tpcds_variables.sh file.
... View more
04-20-2017
07:24 PM
1 Kudo
Why did you cherrypick the queries instead of running all 99? The databricks queries are not dynamic as the TPC-DS benchmark calls for. Instead, they are using constant values which may help them in tuning their queries. 500GB and 1TB are rather small. I guess the goal here is to keep the data in memory? You have enough RAM to keep the largest table in memory which would definitely skew things towards Spark. If you are only concerned with single user query performance, there are some configuration changes you can make to get HAWQ to perform better. How did the concurrency tests perform? I noticed in the databricks DDL that they don't have a single bigint datatype listed. ss_ticket_number in store_sales for instance, will need to be a bigint when you make the scale 3TB. I'm assuming they don't intend to test at scale? Databricks is using a fork of dsdgen to generate the data. Why is that? What are they changing? They also label it in Github as "1.1". Did they fork from the older and deprecated version of TPC-DS? The HAWQ benchmark is using the unmodified TPC-DS toolkit version 2.1.0. I've seen some benchmarks from other vendors misrepresent the execution time. They will capture the time reported by the database rather than the time it takes to start the "xxxxxx-shell" program, execute the query, and then present the results. I've seen reported execution times of 1-2 seconds but the clock time be at least 10 seconds more than that. You may want to verify Spark is really executing as fast as reported. For the HAWQ benchmark, it is the actual time it takes from start to finish just as the TPC-DS benchmark calls for.
... View more
01-13-2017
01:39 PM
1 Kudo
Some corrections. 1. You shouldn't use the PostgreSQL version of psql with HAWQ. While it may work, you should use the one distributed with HAWQ. You'll also want other utilities like gpfdist which are distributed with HAWQ and not part of PostgreSQL. You'll instead want to use rpm to install the utilities on an edge node. rpm -i hawq-2.1.1.0-7.el6.x86_64.rpm 2. Do not use pg_ctl with HAWQ. You should either use Ambari to restart the HAWQ service or use the "hawq" command in a terminal window. pg_ctl will probably stop being distributed with HAWQ in the future. hawq stop cluster -u -a -u means to update the config -a means to do it silently 3. An easier way to allow external connections is to add this to the end of the pg_hba.conf file: host all all 0.0.0.0/0 md5 This means that all external connections will require an encrypted password to authenticate. This is the database password too and not the operating system password. psql -c "alter user gpadmin password 'secret'" That will change the gpadmin password in the database to secret.
... View more
09-01-2016
03:32 PM
It is pretty simple to configure: http://hdb.docs.pivotal.io/20/pxf/ConfigurePXF.html#topic_i3f_hvm_ss
... View more
09-01-2016
03:00 PM
HAWQ's external table supports secured clusters. But I would always prefer to just create a single external table and then run a typical "INSERT INTO ext_table SELECT * FROM hawq_table;" because it would be so much faster to work with. Speaking of speed, I haven't heard of anyone moving data from HAWQ to Hive. It is always the other way round! HAWQ is so much faster and has better SQL support than Hive.
... View more