Created on 06-15-201601:22 AM - edited 08-17-201912:01 PM
HAWQ is the Greenplum Database/Warehouse engine implemented on top of Hadoop Filesystem. To know GreenplumIt gives an edge to understand how relations are stored within Postgres OR Greenplum database (since Greenplum is based on Postgres, of course with certain optimizations to make it work in parallel and have a master/slave architecture).
Unlike hive, a schema is an object within a database. In HAWQs world, this is the setup.
HAWQ Cluster - Combination of HAWQ Master, Standby Master and Segment Hosts. All these machines are physical
MASTER: Is a postgres master process, linux process, that can share the space with name node depending on the configuration and capacity of the Namenode.
STANDBY: Is a replica of MASTER process which includes the logs and metadata information, transferred via log shipping ensuring near realtime replication. Can share the space with Standby/Secondary Namenode host.
SEGMENT: Also known as compute nodes, are physical hosts which can host one or more processes termed as "instances" of postgres process each allocated with their own storage space, memory, CPU. The Segment instances can share the pace with data nodes, depending on the workload.
Is a logical unit within HAWQ Cluster. One cluster can have more than one database, however, keeping this within a lower limit makes it easier to manage resources and maintenance. The orange container within the cluster is representation of a database. This container segregates the data specific to one database within a cluster.
The container OR resource that is located on master and standby servers contain metadata information which are crucial and used while generating optimized plans to execute a query. There is a logical container located even at segment servers which helps segments segregate the database at the segment level.
Schema & Tables
Schema is a logical unit within each database i.e., Schema is private to a database. A simple example to understand this would be something like
Database is like a school and Schema is like Standards. In real life scenarios, the schema can contain tables specific to a department, for instance, Finance, Marketing, Sales etcetera. This figure below describes the layout of how tables are located, pinned within each schema.
Here is an actual demonstration of how all of these look together when accessing a HAWQ cluster using CLI. PSQL or better "psql" is tool similar to mysql binary to access the HAWQ cluster using command line.
Here is how you can login to the HAWQ server:
[gpadmin@sandbox ~]$ psql -h 172.16.105.136 -p 10432 test
Type "help" for help.
Here, psql is the client. "-h" represents the host on which we have installed the cluster. "-p" represents the port on which master is listening for clients/connections and finally "test" is the name of the database. So in this case, test is our database.
We can now check the schemas within a database, there are some metadata schemas which are present as soon as a database is created, lets look at the schema list for test.
List of schemas
Name | Owner
hawq_toolkit | gpadmin
information_schema | gpadmin
pg_aoseg | gpadmin
pg_bitmapindex | gpadmin
pg_catalog | gpadmin
pg_toast | gpadmin
public | gpadmin
These seven schemas are metadata schemas, should not be used for user objects except public. Creating an schema would create a logical separator for objects which are crucial to maintaining data governance and separation. Here is an example.