Member since
09-17-2015
436
Posts
736
Kudos Received
81
Solutions
09-16-2016
07:21 AM
18 Kudos
Update Feb 2018 - Updated article for HDF 3.1: https://community.hortonworks.com/articles/173816/automate-deployment-of-hdf-31-clusters-using-ambar.html Summary: Ambari blueprints can be used to automate setting up clusters. With Ambari support being added to HDF 2.0, the same can be done for HDF clusters running Nifi, Storm, Kafka.
This article shows how you can use ambari-bootstrap project to easily generate a blueprint and deploy HDF clusters to both either single node or development/demo environments in 5 easy steps. If you prefer, a script is also provided at the bottom of the article that automates these steps, so you can deploy the cluster in a few commands. Tested with HDF 2.x and 3.0 There is also a single node HDF 2.1 demo cluster available on AWS as an AMI which can be brought up in 10 min. Details here Prerequisite: A number of freshly installed hosts running CentOS/RHEL 6 or 7 where HDF is to be installed Reminder: Do not try to install HDF on a env where Ambari or HDP are already installed (e.g. HDP sandbox or HDP cluster) Steps: 1. After choosing a host where you would like Ambari-server to run, first let's prepare the other hosts. Run this on all hosts where Ambari-server will not be running to run pre-requisite steps, install Ambari-agents and point them to Ambari-server host: export ambari_server=<FQDN of host where ambari-server will be installed>; #replace this
export install_ambari_server=false
export ambari_version=2.5.1.0 ##don't use 2.5.2 for HDF, there is a bug
curl -sSL https://raw.githubusercontent.com/seanorama/ambari-bootstrap/master/ambari-bootstrap.sh | sudo -E sh ; 2. Run remaining steps on host where Ambari-server is to be installed. These run pre-reqs and install Ambari-server export ambari_password="admin" # customize password
export cluster_name="HDF" # customize cluster name
export ambari_services="ZOOKEEPER NIFI KAFKA STORM LOGSEARCH AMBARI_METRICS AMBARI_INFRA"
export hdf_ambari_mpack_url="http://public-repo-1.hortonworks.com/HDF/centos7/3.x/updates/3.0.0.0/tars/hdf_ambari_mp/hdf-ambari-mpack-3.0.0.0-453.tar.gz" #replace with the mpack url you want to install
export ambari_version=2.5.1.0 ##don't use 2.5.2 for HDF, there is a bug
#install bootstrap
yum install -y git python-argparse
git clone https://github.com/seanorama/ambari-bootstrap.git
#Runs pre-reqs and install ambari-server
export install_ambari_server=true
~/ambari-bootstrap/ambari-bootstrap.sh
3. Install mpack and restart Ambari so it forgets HDP and recognizes only HDF stack: ambari-server install-mpack --mpack=${hdf_ambari_mpack_url} --verbose
ambari-server restart
At this point, if you wanted you could use Ambari install wizard to install HDF you can do that as well. Just open http://<Ambari host IP>:8080 and login and follow the steps in the doc. Other to proceed with deploying via blueprints follow the remaining steps. 4. (Optional) modify any configurations you like for any of the components by creating configuration-custom.json. Below shows how to customize Nifi dirs, ports, and the user/group the service runs as. Basically you would create sections in the JSON corresponding to the name of the relevant config file and include the property name and desired value. For a complete listing of available Nifi property files and corresponding properties that Ambari recognizes, check the Nifi service code cd ~/ambari-bootstrap/deploy/
tee configuration-custom.json > /dev/null << EOF
{
"configurations" : {
"nifi-ambari-config": {
"nifi.security.encrypt.configuration.password": "changemeplease",
"nifi.content.repository.dir.default": "/nifi/content_repository",
"nifi.database.dir": "/nifi/database_repository",
"nifi.flowfile.repository.dir": "/nifi/flowfile_repository",
"nifi.internal.dir": "/nifi",
"nifi.provenance.repository.dir.default": "/nifi/provenance_repository",
"nifi.max_mem": "1g",
"nifi.node.port": "9092",
"nifi.node.protocol.port": "9089",
"nifi.node.ssl.port": "9093"
},
"nifi-env": {
"nifi_user": "mynifiuser",
"nifi_group": "mynifigroup"
}
}
}
EOF
5. If you chose to skip the previous step, run below to generate a basic configuration-custom.json file. Change the password, but make sure its at least 12 characters or deployment will fail. echo '{ "configurations" : { "nifi-ambari-config": { "nifi.security.encrypt.configuration.password": "changemeplease" }}}' > ~/ambari-bootstrap/deploy/configuration-custom.json Then generate a recommended blueprint and deploy the cluster install. Make sure to set host_count to the total number of hosts in your cluster (including Ambari server) export host_count=<Number of total nodes>
export ambari_stack_name=HDF
export ambari_stack_version=3.0 #replace this with HDF stack version
export ambari_services="NIFI KAFKA STORM AMBARI_METRICS ZOOKEEPER LOGSEARCH AMBARI_INFRA"
./deploy-recommended-cluster.bash
You can now login into Ambari at http://<Ambari host IP>:8080 and sit back and watch your HDF cluster get installed! Notes: a) This will only install Nifi on a single node of the cluster by default b) Nifi Certificate Authority (CA) component will be installed by default. This means that if you wanted to, you could enable SSL to be enabled for Nifi out of the box by including a "nifi-ambari-ssl-config" section in the above configuration-custom.json: "nifi-ambari-ssl-config": {
"nifi.toolkit.tls.token": "hadoop",
"nifi.node.ssl.isenabled": "true",
"nifi.security.needClientAuth": "true",
"nifi.toolkit.dn.suffix": ", OU=HORTONWORKS",
"nifi.initial.admin.identity": "CN=nifiadmin, OU=HORTONWORKS",
"content":"<property name='Node Identity 1'>CN=node-1.fqdn, OU=HORTONWORKS</property><property name='Node Identity 2'>CN=node-2.fqdn, OU=HORTONWORKS</property><property name='Node Identity 3'>node-3.fqdn, OU=HORTONWORKS</property>"
},
Make sure to replace node-x.fqdn with the FQDN of each node running Nifi c) As part of the install, you can also have an existing Nifi flow deployed by Ambari. First, read in a flow.xml file from existing Nifi system (you can find this in flow.xml.gz). For example, run below to read the flow for the Twitter demo into an env var twitter_flow=$(curl -L https://gist.githubusercontent.com/abajwa-hw/3a3e2b2d9fb239043a38d204c94e609f/raw)
Then include a "nifi-ambari-ssl-config" section in the above configuration-custom.json when you run the tee command - to have ambari-bootstrap include the whole flow xml into the generated blueprint: "nifi-flow-env" : {
"properties_attributes" : { },
"properties" : {
"content" : "${twitter_flow}"
}
} d) In case you would like to review the generated blueprint before it gets deployed, just set the below variable as well: export deploy=false .... The blueprint will be created under ~/ambari-bootstrap/deploy/tempdir*/blueprint.json Sample script A sample script based on this logic is available here. In addition to the steps above it can also optionally: enable installation of Nifi to all nodes of the cluster sets up Ambari's Postgres DB for Ranger (in case Ranger will be installed post-cluster-install) sets up KDC (in case kerberos will be enabled later) For example, to deploy a single node HDF sandbox, you can just run below on freshly installed CentOS 6 VM (don't run this on sandbox or VM where Ambari already installed). You can customize the behaviour by exporting environment variables as shown. #run below as root
export host_count=1;
curl -sSL https://gist.github.com/abajwa-hw/ae4125c5154deac6713cdd25d2b83620/raw | sudo -E sh ;
What next? Now that your cluster is up, you can explore what Nifi's Ambari integration means: https://community.hortonworks.com/articles/57980/hdf-20-apache-nifi-integration-with-apache-ambarir.html Next, you can enable SSL for Nifi: https://community.hortonworks.com/articles/58009/hdf-20-enable-ssl-for-apache-nifi-from-ambari.html Sample blueprint Sample generated blueprint for 3 node cluster is provided for reference here: {
"Blueprints": {
"stack_name": "HDF",
"stack_version": "2.0"
},
"host_groups": [
{
"name": "host-group-1",
"components": [
{
"name": "METRICS_MONITOR"
},
{
"name": "SUPERVISOR"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "NIFI_CA"
},
{
"name": "NIMBUS"
},
{
"name": "DRPC_SERVER"
},
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "STORM_UI_SERVER"
}
]
},
{
"name": "host-group-2",
"components": [
{
"name": "NIFI_MASTER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "SUPERVISOR"
},
{
"name": "INFRA_SOLR"
},
{
"name": "INFRA_SOLR_CLIENT"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "LOGSEARCH_SERVER"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "METRICS_GRAFANA"
},
{
"name": "KAFKA_BROKER"
},
{
"name": "ZOOKEEPER_SERVER"
}
]
},
{
"name": "host-group-3",
"components": [
{
"name": "METRICS_MONITOR"
},
{
"name": "SUPERVISOR"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "METRICS_COLLECTOR"
},
{
"name": "ZOOKEEPER_SERVER"
}
]
}
],
"configurations": [
{
"nifi-ambari-config": {
"nifi.node.protocol.port": "9089",
"nifi.internal.dir": "/nifi",
"nifi.node.port": "9092",
"nifi.provenance.repository.dir.default": "/nifi/provenance_repository",
"nifi.content.repository.dir.default": "/nifi/content_repository",
"nifi.flowfile.repository.dir": "/nifi/flowfile_repository",
"nifi.max_mem": "1g",
"nifi.database.dir": "/nifi/database_repository",
"nifi.node.ssl.port": "9093"
}
},
{
"ams-env": {
"metrics_collector_heapsize": "512"
}
},
{
"ams-hbase-env": {
"hbase_master_heapsize": "512",
"hbase_regionserver_heapsize": "768",
"hbase_master_xmn_size": "192"
}
},
{
"storm-site": {
"metrics.reporter.register": "org.apache.hadoop.metrics2.sink.storm.StormTimelineMetricsReporter"
}
},
{
"nifi-env": {
"nifi_group": "mynifigroup",
"nifi_user": "mynifiuser"
}
},
{
"ams-hbase-site": {
"hbase.regionserver.global.memstore.upperLimit": "0.35",
"hbase.regionserver.global.memstore.lowerLimit": "0.3",
"hbase.tmp.dir": "/var/lib/ambari-metrics-collector/hbase-tmp",
"hbase.hregion.memstore.flush.size": "134217728",
"hfile.block.cache.size": "0.3",
"hbase.rootdir": "file:///var/lib/ambari-metrics-collector/hbase",
"hbase.cluster.distributed": "false",
"phoenix.coprocessor.maxMetaDataCacheSize": "20480000",
"hbase.zookeeper.property.clientPort": "61181"
}
},
{
"logsearch-properties": {}
},
{
"kafka-log4j": {}
},
{
"ams-site": {
"timeline.metrics.service.webapp.address": "localhost:6188",
"timeline.metrics.cluster.aggregate.splitpoints": "kafka.network.SocketServer.IdlePercent.networkProcessor.0.5MinuteRate",
"timeline.metrics.host.aggregate.splitpoints": "kafka.network.SocketServer.IdlePercent.networkProcessor.0.5MinuteRate",
"timeline.metrics.host.aggregator.ttl": "86400",
"timeline.metrics.service.handler.thread.count": "20",
"timeline.metrics.service.watcher.disabled": "false"
}
},
{
"kafka-broker": {
"kafka.metrics.reporters": "org.apache.hadoop.metrics2.sink.kafka.KafkaTimelineMetricsReporter"
}
},
{
"ams-grafana-env": {}
}
]
}
... View more
06-10-2016
11:36 PM
Could you provide the github link or upload the template xml before we publish this? Also would be good to show what the tweet looks like before/after processing
... View more
06-10-2016
10:47 PM
Upvoted! If the notebook works on sandbox, pls consider including it in https://github.com/hortonworks-gallery/zeppelin-notebooks. This is actually the set of demo notebooks that get automatically installed when zeppelin is installed via Ambari
... View more
05-27-2016
06:39 AM
1 Kudo
Zeppelin interpreters allow Zeppelin to submit requests to multiple languages or data processing engines (like Scala, PySpark, Shell, Hive, HBase etc). More details available at https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/manual/interpreters.html However, before you can use interpreters, in most cases you need to configure them (e.g. point hive interpreter to server/port where HiveServer2 is running on your cluster). You can do this manually via the Zeppelin UI by clicking on the Interpreter tab. However, in some cases you will want to automate/script out this step. Below is some sample code on how to do this. To run, on the host where Zeppelin is running, copy/paste the code into a file called zeppelin.py using your favourite editor, and then run it by typing: python zeppelin.py Sample code: #!/usr/local/bin/python
def post_request(url, body):
import json, urllib2
encoded_body = json.dumps(body)
req = urllib2.Request(str(url), encoded_body)
req.get_method = lambda: 'PUT'
try:
response = urllib2.urlopen(req, encoded_body).read()
except urllib2.HTTPError, error:
print 'Exception: ' + error.read()
jsonresp = json.loads(response.decode('utf-8'))
print jsonresp['status']
import json, urllib2
zeppelin_int_url = 'http://localhost:9995/api/interpreter/setting/'
data = json.load(urllib2.urlopen(zeppelin_int_url))
for body in data['body']:
if body['group'] == 'psql':
psqlbody = body
elif body['group'] == 'hive':
hivebody = body
hivebody['properties']['hive.hiveserver2.url'] = 'jdbc:hive2://localhost:10000'
post_request(zeppelin_int_url + hivebody['id'], hivebody)
psqlbody['properties']['postgresql.user'] = 'gpadmin'
psqlbody['properties']['postgresql.password'] = 'gpadmin'
psqlbody['properties']['postgresql.url'] = 'jdbc:postgresql://localhost:10432/postgres'
post_request(zeppelin_int_url + psqlbody['id'], psqlbody)
What does the code do? First it is connecting to Zeppelin REST API on port 9995 and fetching the configs for hive and psql interpreters Then it is updating hive interpreter configs (by setting the HS2 url to jdbc:hive2://localhost:10000) and posting the updated config back to Zeppelin Finally it is updating psql interpreter configs (by setting the username/pass and udpating psql url to jdbc:postgresql://localhost:10432/postgres) and posting the updated config back to Zeppelin Sample output is shown below. For each request made, it will output OK if the POST request succeeded. # python zeppelin.py
OK
OK
Also if your urls/credentials are different from above, feel free to modify the python script and re-run the script (you can re-run the script as many times as you like) Now if you open your Zeppelin interpreter tab, you will notice that the Hive and psql interpreters have been updated with above configs.
... View more
Labels:
05-19-2016
01:29 AM
8 Kudos
Deploy HDB (HAWQ) on HDP 2.4.2 and use Zeppelin for Visualization Goals: Install a 4 node cluster running HDP 2.4.2 using Ambari 2.2.2.0 (including Zeppelin and HDB) using Ambari bootstrap via blueprints or Ambari install wizard Configure HAWQ for Zeppelin Configure Zeppelin for HAWQ Run HAWQ queries via Zeppelin Access Hive tables from HAWQ Notes: HDB managed via Ambari is only supported from Ambri 2.2.2.0 onwards. Do not attempt using older versions of Ambari Single node setup:
The same steps can be used to install a single node setup - just make sure to change the HAWQ master port from default (5432) - for example: to 10432. Sample script to automate bringing up a single node HDP cluster with HDB/Zeppelin installed is available here Prebuilt single node VM for VMWare built with HDP 2.4.2, Ambari 2.2.2.0, HDB 2.0.0.0 and Zeppelin is available for download here (size: 4.84 GB). It was setup using below so once services are up, you can access Zeppelin (on port 9995) and use the existing 'HAWQ' notebook to run HAWQ queries. Update (July 10th 2016): An official HDB/HDP sandbox has been posted here (SSH login: root/hawq2016) Install Ambari 2.2.2.0 and HDB/Zeppelin service definitions Bring up 4 VMs imaged with RHEL/CentOS 6.x (e.g. node1-4 in this case) On non-ambari nodes (nodes2-4 in this case), install ambari-agents and point them to ambari node (e.g. node1 in this case). For single node setup, this step can be skipped. export ambari_server=node1
curl -sSL https://raw.githubusercontent.com/seanorama/ambari-bootstrap/master/ambari-bootstrap.sh | sudo -E sh On Ambari node (e.g. node1), install ambari-server export install_ambari_server=true
curl -sSL https://raw.githubusercontent.com/seanorama/ambari-bootstrap/master/ambari-bootstrap.sh | sudo -E sh Install Zeppelin service definition yum install -y git
git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/2.4/services/ZEPPELIN
sed -i.bak '/dependencies for all/a \ "ZEPPELIN_MASTER-START": ["NAMENODE-START", "DATANODE-START"],' /var/lib/ambari-server/resources/stacks/HDP/2.4/role_command_order.json Install Pivotal service definition and repo per HDB doc
Create staging dir: mkdir /staging
chmod a+rx /staging Transfer (e.g. by scp) hdb-2.0.0.0-22126.tar.gz and hdb-ambari-plugin-2.0.0-448.tar.gz from Pivotal Network to /staging dir Setup HDB repo and Ambari service definition: tar -xvzf /staging/hdb-2.0.0.0-*.tar.gz -C /staging/
tar -xvzf /staging/hdb-ambari-plugin-2.0.0-*.tar.gz -C /staging/
yum install -y httpd
service httpd start
cd /staging/hdb*
./setup_repo.sh
cd /staging/hdb-ambari-plugin*
./setup_repo.sh
yum install -y hdb-ambari-plugin At this point you should see a local repo up at http://node1/HDB/ Restart Ambari so it now recognizes Zeppelin, HAWQ, PXF services service ambari-server restart
service ambari-agent restart
Confirm 4 agents were registered and agent is up curl -u admin:admin -H X-Requested-By:ambari http://localhost:8080/api/v1/hosts
service ambari-agent status Deploy vanilla HDP + Zeppelin + HDB Deploy cluster running latest HDP including Zeppelin, HAWQ, PXF. You can either: Option 1: login to Ambari UI and use Install Wizard. In this case: You will need to set the 'HAWQ System User Password' to any value you like Make sure to manually adjust the HDFS settings mentioned in HDB doc Make sure that the port specified in 'HAWQ master port' (by default, 5432) is not in use on the host where you will install HAWQ master If installing on single node or any other scenario where HAWQ master need to be installed on node where a postgres setup already exists (e.g. if installing HAWQ master on the same host where Ambari is installed) you will need to change the master port from default value (5432) On single node setup, 'HAWQ standby master' will not be installed Refer to HDB doc for full details OR Option 2: generate/deploy a customized blueprint using ambari-bootstrap. Basically in the below steps, we will first provide a list of components to be installed along with a node count. Then we will provide a list of custom configurations (e.g. HDFS settings that HDB requires), and they will be merged with the configs that Ambari recommendations engine suggests for your specific environment (e.g. YARN memory settings etc) to create a blueprint tailored for your cluster. Install bootstrap, decide what services to deploy and provide host_count (for single node setup, change this to 1) #install bootstrap
yum install -y python-argparse
cd
git clone https://github.com/seanorama/ambari-bootstrap.git
#decide which services to deploy and set the number of nodes in the cluster
export ambari_services="HDFS MAPREDUCE2 YARN ZOOKEEPER HIVE ZEPPELIN SPARK HAWQ PXF"
export host_count=4
cd ./ambari-bootstrap/deploy/
Create configuration-custom.json with custom configs to be included in the blueprint e.g. below we are including HDFS config customizations for HAWQ. If installing HAWQ master on single node (or host where postgres/Ambari already running) change the hawq_master_address_port from 5432 (e.g. to 10432) cat << EOF > configuration-custom.json
{
"configurations" : {
"hdfs-site": {
"dfs.allow.truncate": "true",
"dfs.block.access.token.enable": "false",
"dfs.block.local-path-access.user": "gpadmin",
"dfs.client.read.shortcircuit": "true",
"dfs.client.socket-timeout": "300000000",
"dfs.client.use.legacy.blockreader.local": "false",
"dfs.datanode.handler.count": "60",
"dfs.datanode.socket.write.timeout": "7200000",
"dfs.namenode.handler.count": "600",
"dfs.support.append": "true"
},
"hawq-site":{
"hawq_master_address_port":"5432"
},
"hawq-env":{
"hawq_password":"gpadmin"
},
"core-site": {
"ipc.client.connection.maxidletime": "3600000",
"ipc.client.connect.timeout": "300000",
"ipc.server.listen.queue.size": "3300"
}
}
}
EOF
Review the blueprint if you like, otherwise deploy the cluster creation #optional - if you want to review the BP before deploying it
#export deploy=false
#./deploy-recommended-cluster.bash
#more temp*/blueprint.json
#generate BP including customizations and start cluster deployment
export deploy=true
./deploy-recommended-cluster.bash
This will kick off HDP cluster install, including Zeppelin, HAWQ and PXF. You can monitor the install it via Ambari at http://node1:8080 Once complete, your cluster of HDP + HDB will be up Note on resource management: By default HAWQ runs in standalone resource management mode. To enable HAWQ to run on YARN:
Ambari > HAWQ > Configs > Settings > Resource Management > Change 'Resource Manager' to YARN Other configs you may want to modify when running in YARN resource management mode relate to the YARN RM and scheduler endpoints, plus the YARN queue to start HAWQ AM in: Note: there is currently a bug (applicable to deployment via blueprints) where the hawq_rm_yarn_address and hawq_rm_yarn_scheduler_address are not auto-populated to the correct settings for your cluster and need to be manually changed. This issue does not apply when using 'Add service' wizard to add HAWQ. Troubleshooting:
If you get an error saying "Attempted to create a Blueprint which already exists, blueprint_name=recommended", most likely you already have a blueprint called 'recommended'. Try the below to delete it first and re-try: curl -u admin:admin -H X-Requested-By:ambari -X DELETE http://localhost:8080/api/v1/blueprints/recommended
Configure HAWQ for Zeppelin On HAWQ master node (node3 in this exmple), run below to create a separate DB in HAWQ for Zeppelin to access:
SSH in connect to HAWQ via psql shell create a new DB su - gpadmin
source /usr/local/hawq/greenplum_path.sh
export PGPORT=5432
psql -d postgres
create database contoso;
\q
Note: you only need to set PGPORT if HAWQ master was not installed on default port (5432). If you specified a different port, you will need to set this accordingly. Also note: if you forgot the HAWQ master password you set earlier and need to reset it, you can do this by psql shell: ALTER ROLE gpadmin WITH PASSWORD 'gpadmin';
On HAWQ master node, run below to add the IP of zeppelin node to HAWQ pg_hba.conf conf. This is done to allow Zeppelin to access HAWQ from a different node
Make sure to replace 172.17.0.2 below with IP of host running Zeppelin echo "host all all 172.17.0.2/32 trust" >> /data/hawq/master/pg_hba.conf Either restart HAWQ via Ambari or, to avoid restarting, you can run the below command as gpadmin user to reload HAWQ configs without starting the components. See doc for more details on the reload option hawq stop cluster --reload Configure Zeppelin for HAWQ Open Zeppelin interpreter and scroll down to section for psql and make below changes to use zeppelin user to connect to contoso DB on HAWQ master node/port:
postgresql.url = jdbc:postgresql://node3:5432/contoso change the hostname to your hawq master hostname if installing on single node, change the port to the updated port (e.g. 10432) postgresql.password = gpadmin (or whatever you set it to during installation of HAWQ) Run HAWQ queries via Zeppelin Create a new note in Zeppelin with below cells to create/populate a test table and calculate avg of subset: %psql.sql
create table tt (i int);
insert into tt select generate_series(1,1000000);
%psql.sql
select avg(i) from tt where i>5000; Access Hive tables via HAWQ
It's possible to access Hive tables from HAWQ (via PXF) using both the command-line as well as through Zeppelin. First let's import some sample data into Hive On Hive host, run below to download csv and create Hive table #Import sample tables
cd /tmp
wget https://raw.githubusercontent.com/abajwa-hw/security-workshops/master/data/sample_07.csv
beeline -u "jdbc:hive2://localhost:10000/default"
use default;
CREATE TABLE `sample_07` (
`code` string ,
`description` string ,
`total_emp` int ,
`salary` int )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile;
load data local inpath '/tmp/sample_07.csv' into table sample_07;
To access Hive table from HAWQ via psql shell you can run below (replace node2 with host where Hive is running): su - gpadmin
source /usr/local/hawq/greenplum_path.sh
export PGPORT=5432
psql -d postgres
SET hcatalog_enable = true;
SET pxf_service_address TO "node2:51200";
select count(*) from hcatalog.default.sample_07; Using Zeppelin you can run similar steps (replace node2 with host where Hive is running): %psql.sql
SET hcatalog_enable = true;
SET pxf_service_address TO "node2:51200"; %psql.sql
select description, salary from hcatalog.default.sample_07; (Optional) - Run queries on retail sample data A retail dataset with sample queries can be found at https://github.com/pivotalsoftware/pivotal-samples
To import the data and create the hive tables, run below from your host running Hive cd /tmp
git clone https://github.com/pivotalsoftware/pivotal-samples.git
cd /tmp/pivotal-samples/sample-data/
sudo -u hdfs ./load_data_to_HDFS.sh
sudo -u hdfs hdfs dfs -chmod -R 777 /retail_demo
hive -f /tmp/pivotal-samples/hive/create_hive_tables.sql
Now you can run the sample queries via Zeppelin:
Top 10 categories in terms of items sold for all time %psql.sql
SELECT product_id
, product_category_id
, product_count
, category_rank
FROM (SELECT product_id, product_category_id
, SUM(item_quantity::int) AS product_count
, row_number() OVER (PARTITION BY product_category_id ORDER BY SUM(item_quantity::int) DESC) AS category_rank
FROM hcatalog.retail_demo.order_lineitems_hive
GROUP BY product_id, product_category_id
) AS lineitems
WHERE category_rank <= 10
ORDER BY product_category_id, category_rank
;
Customers first and last order ID/date %psql.sql
SELECT CASE WHEN order_datetime::timestamp < timestamp '2010-10-08' THEN date_trunc('day', order_datetime::timestamp + interval ' 1 week')
ELSE date_trunc('day', order_datetime::timestamp)
END::date AS order_day
, SUM(CASE WHEN order_datetime >= timestamp '2010-01-08' THEN 1 ELSE 0 END) AS num__orders_current
, SUM(CASE WHEN order_datetime < timestamp '2010-01-08' THEN 1 ELSE 0 END) AS num__orders_last_week
FROM hcatalog.retail_demo.order_lineitems_hive
WHERE order_datetime BETWEEN timestamp '2010-10-01' AND timestamp '2010-10-15 23:59:59'
GROUP BY 1
ORDER BY 1
;
Top 10 categories in terms of items sold for all time %psql.sql
SELECT product_id
, product_category_id
, product_count
, category_rank
FROM (SELECT product_id, product_category_id
, SUM(item_quantity::int) AS product_count
, row_number() OVER (PARTITION BY product_category_id ORDER BY SUM(item_quantity::int) DESC) AS category_rank
FROM hcatalog.retail_demo.order_lineitems_hive
GROUP BY product_id, product_category_id
) AS lineitems
WHERE category_rank <= 10
ORDER BY product_category_id, category_rank
; Troubleshooting When trying to run HAWQ queries from Zeppelin, if you get java.lang.NullPointerException as below: Double check the psql interpreter settings as mentioned above and restart the interpreter
Make sure postgresql.user and postgresql.password are correct (if needed, reset the gpadmin password using command mentioned above) Make sure postgresql.url is referencing
correct hostname and port for HAWQ master valid database If running in YARN mode and the HAWQ queries are hanging, it's possible that there that the HAWQ YARN Application master does not have sufficient resources. You can confirm this by opening the resource manager UI (Ambari > YARN > Quicklinks) and see if any HAWQ jobs are sitting in 'Accepted' state (instead of 'Running'). You can also confirm by checking the latest HAWQ master log (under /data/hawq/master/pg_log/hawq*.log): it may show messages like "fail to allocate from YARN RM after retry several times" If there are existing jobs running on YARN (e.g. Spark, Hive etc), you may want to wait for them to complete or kill them and re-try
... View more
Labels:
04-04-2016
08:11 PM
1 Kudo
Big Data Wrangling on HDP with Trifacta - How to Get started Data Preparation is a constant challenge for any enterprise and the speed, diversity and volumes of data introduced by Big Data simply amplify this problem substantially. Trifacta with HDP helps introduce a new approach to organizing, cleansing, enriching and structuring your data, Data Wrangling, where business users are able to connect and engage with the data to drive out high quality data sets for analytics. Step 1: Download VM Trifacta would like to provide Hortonworks partners and SI's with an opportunity to test drive Data Wrangling on HDP. Here is a link to a pre-configured virtual machine containing Trifacta Enterprise and HDP 2.3. Feel free to download and try Wrangling today: ftp://download.trifacta.com/Hortonworks/Trifacta_3.0_HDP_2.3.2_sandbox.ova user: hortonworks pw: wrangler Step 2: Start VM and access consoles The Trifacta Enterprise Wrangler on HDP is built on HDP 2.3 and the demo/sandbox instance of Centos. To access the instance: When using vmware desktop/fusion, the VM is configured to share with your host (NAT) so the IP issued to your VM should be something like this, so the ambari login is: http://172.16.238.133:8080 login to ambari (admin/admin) to ensure your HDP services are running NOTE: If using VirtualBox, port forwarding will allow you to access these services on the same ports, but through localhost: http://127.0.0.1:8080 to access the vm via ssh: ssh root@172.16.238.133 pw: Wrangler!123 to start the trifacta service, type: service trifacta start to access the trifacta UI, http://172.16.238.133:3005 user: admin@trifacta.local pw: admin Step 3: Try out the demos The demo instance comes configured with 11 canned trifacta demos, the datasets for these is available for use immediately:
CPG_CrossSell IoT_CityBike CPG_InventoryPlanning Pharmacovigilance_DrugSafety ClickStream_WeblogAnalytics SIEM_CyberSecurity DemoContentOverview.pdf SalesDashboard_For_Executives FinServ_TraderFraud TelcoChurn_4MinuteDemo TelcoChurn_Customer360 Insurance_CrossSell Data Wrangling allows a business user to discovery, register, transform, structure and publish high quality analytic data sets in a matter of minutes. Register on the Trifacta Partner Portal for more information of these demos. https://trifacta.channeltivity.com/ for more, visit http://www.trifacta.com.
... View more
01-19-2016
02:10 AM
Thanks @Scott Shaw...updated
... View more
01-19-2016
02:10 AM
Thanks @Artem Ervits! Updated it
... View more
12-30-2015
11:12 PM
1 Kudo
Awesome stuff!
... View more
12-29-2015
09:39 PM
@Andrew Grande: thanks for the info! @azeltov looks like there was a missing class attribute in step 2 where solrconfig.xml is modified. I have added it in and tested that it works. <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
... View more