Community Articles

srai1 · ‎06-15-2016

HAWQ is the Greenplum Database/Warehouse engine implemented on top of Hadoop Filesystem. To know GreenplumIt gives an edge to understand how relations are stored within Postgres OR Greenplum database (since Greenplum is based on Postgres, of course with certain optimizations to make it work in parallel and have a master/slave architecture).

Unlike hive, a schema is an object within a database. In HAWQs world, this is the setup.

HAWQ Cluster - Combination of HAWQ Master, Standby Master and Segment Hosts. All these machines are physical

MASTER: Is a postgres master process, linux process, that can share the space with name node depending on the configuration and capacity of the Namenode.

STANDBY: Is a replica of MASTER process which includes the logs and metadata information, transferred via log shipping ensuring near realtime replication. Can share the space with Standby/Secondary Namenode host.

SEGMENT: Also known as compute nodes, are physical hosts which can host one or more processes termed as "instances" of postgres process each allocated with their own storage space, memory, CPU. The Segment instances can share the pace with data nodes, depending on the workload.

Database

Is a logical unit within HAWQ Cluster. One cluster can have more than one database, however, keeping this within a lower limit makes it easier to manage resources and maintenance. The orange container within the cluster is representation of a database. This container segregates the data specific to one database within a cluster.

The container OR resource that is located on master and standby servers contain metadata information which are crucial and used while generating optimized plans to execute a query. There is a logical container located even at segment servers which helps segments segregate the database at the segment level.

Schema & Tables

Schema is a logical unit within each database i.e., Schema is private to a database. A simple example to understand this would be something like Database is like a school and Schema is like Standards. In real life scenarios, the schema can contain tables specific to a department, for instance, Finance, Marketing, Sales etcetera. This figure below describes the layout of how tables are located, pinned within each schema.

Here is an actual demonstration of how all of these look together when accessing a HAWQ cluster using CLI. PSQL or better "psql" is tool similar to mysql binary to access the HAWQ cluster using command line.

Here is how you can login to the HAWQ server:

[gpadmin@sandbox ~]$ psql -h 172.16.105.136 -p 10432 test
psql (8.2.15)
Type "help" for help.
test=#

Here, psql is the client. "-h" represents the host on which we have installed the cluster. "-p" represents the port on which master is listening for clients/connections and finally "test" is the name of the database. So in this case, test is our database.

We can now check the schemas within a database, there are some metadata schemas which are present as soon as a database is created, lets look at the schema list for test.

test=# \dn
       List of schemas
        Name        |  Owner  
--------------------+---------
 hawq_toolkit       | gpadmin
 information_schema | gpadmin
 pg_aoseg           | gpadmin
 pg_bitmapindex     | gpadmin
 pg_catalog         | gpadmin
 pg_toast           | gpadmin
 public             | gpadmin
(7 rows)

test=#

These seven schemas are metadata schemas, should not be used for user objects except public. Creating an schema would create a logical separator for objects which are crucial to maintaining data governance and separation. Here is an example.

test=# create schema finance;
CREATE SCHEMA
test=# \dn
       List of schemas
        Name        |  Owner  
--------------------+---------
 finance            | gpadmin  <<<<<<<<<<<<<<<
 hawq_toolkit       | gpadmin
 information_schema | gpadmin
 pg_aoseg           | gpadmin
 pg_bitmapindex     | gpadmin
 pg_catalog         | gpadmin
 pg_toast           | gpadmin
 public             | gpadmin
(8 rows)

test=#

We create a schema "finance" which is now listed once we type "\dn" short for display namespace. We will now create table within this schema and see how its laid out.

test=# create table finance.testtable (col1 int, col2 int, col3 double precision, col4 text);
CREATE TABLE
test=# \dt finance.testtable
                  List of relations
 Schema  |   Name    | Type  |  Owner  |   Storage   
---------+-----------+-------+---------+-------------
 finance | testtable | table | gpadmin | append only
(1 row)


test=# \d finance.testtable
 Append-Only Table "finance.testtable"
 Column |       Type       | Modifiers 
--------+------------------+-----------
 col1   | integer          | 
 col2   | integer          | 
 col3   | double precision | 
 col4   | text             | 
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Distributed randomly

Cloudera Community

Community Articles

HAWQ Hierarchy basics

Apache Hadoop

Apache Hive