Member since
02-11-2017
17
Posts
0
Kudos Received
0
Solutions
10-26-2017
02:31 AM
Hello, I deployed Hortonworks HDP in a 4 nodes machine, in order to perform some benchmarks between tools like hive and spark(2.0). Since i started with hive i did some research and i found some information that we could use beeline to query hive data with spark (using the commandbeeline -u "jdbc:hive2://hadoop-1:10001/;transportMode=http;httpPath=cliservice" -n spark --force=true -f tpch_query1.sql). I verified that this actually works, but the performance are surprisingly slower than hive, is this a valid comparison betweeen spark and hive performance? If not how can i query the data that i have in hive without losing performance? Another aspect, i read that Spark uses in-memory processing, same logic as tools like presto, hawq or cloudera impala. But when i execute some query, using the command writed above it seems the processing is made by MapReduce Jobs. Can you share some light on these subjects?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
-
Apache Spark
10-21-2017
02:08 AM
I have 4 node cluster with ambari deployed, a while ago i try to add PXF and Hawq, in 3 of the 4 nodes everything installed and successfully, but in one of the nodes fails to run PXF Agent because it cant detect tomcat, which is insalled along PXF Agent. I already tried to start it manually but the result is the same. When i try to start pxf it gives the following error, until it ends up failing: SEVERE: No shutdown port configured. Shut down server through OS signal. Server not shut down. The stop command failed. Attempting to signal the process to stop through OS signal. Tomcat stopped. / /var/pxf / Tomcat started. / Checking if tomcat is up and running... tomcat not responding, re-trying after 1 second (attempt number 1) tomcat not responding, re-trying after 1 second (attempt number 2) tomcat not responding, re-trying after 1 second (attempt number 3) tomcat not responding, re-trying after 1 second (attempt number 4) tomcat not responding, re-trying after 1 second (attempt number 5) CAn anyone provide some help on this problem?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
09-22-2017
09:45 AM
Hello, I have configured a 4 node Cluster and installed ambari and required tools. Configure the basic tools (Hive, Oozie, Spark, zookeeper, etc) i installed PXF and Hawq with 3 segments. But now i want to query Data store in hive, i know that i can use Hcatalog and avoid creating PXF external tables. What i dont know is how to access Hawq, i see that it has to be through postgreSQL, but when i try to access sql it always give me this: psql: FATAL: no pg_hba.conf entry for host "[local]", user "gpadmin", database "gpadmin", SSL off (loged as gpadmin). If im loged in with other user the same happens... Can you tell me how to access psql correctlyin order to query hive data? Another thing ive seen that some foruns say that if we use hcatalog we only use one node to query data, is this true? If this is true i would be obligated to create external tables
... View more
Labels:
- Labels:
-
Apache HCatalog
-
Apache Hive
07-07-2017
12:17 PM
Hello Im trying Pivotal Hawq with ambari and now im trying to run some queries over hive tables with hawq. From what i have seen Hawq can query hive tables through HCatalog (https://community.hortonworks.com/articles/43264/hawqhdb-and-hadoop-with-hive-and-hbase.html ), and so, i use psql tool on the comand line to run queries like this: SELECT * FROM hcatalog.hive-db-name.hive-table-name; Previously i run some queries on Hive to compare results with Hawq, i was expecting hawq to be much faster, but hawq its being much more slow, the query response is much more long than in Hive. The specfic query that i am trying to run is query 1 from TPCH on hive table stored as ORC. Hive took 18 seconds, running the query in psql tool with hcatalog 6 minutes and 28s. Can someone explain why is this happening?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache HCatalog
-
Apache Hive
06-24-2017
08:10 PM
Hello, Im trying to generate and load data into hive tables, through Hive testbench (https://github.com/hortonworks/hive-testbench), but when i do the first step ./tpch-build it gives the following error: cd target/; mkdir -p lib/; ( jar cvf lib/dbgen.jar tools/ || gjar cvf lib/dbgen.jar tools/ )
/bin/sh: jar: command not found
/bin/sh: gjar: command not found
make: *** [target/lib/dbgen.jar] Error 127 i already tried to download the tpch_kit.zip and place it manually like someone sugest in this post https://community.hortonworks.com/questions/25826/hive-benchmarking-error-tpcds-kitzip-issue.html When i check the target/lib directory the dbgen.jar is not present i dont know why... Does anyone have some sugestion why is this happening?Thanks
... View more
Labels:
- Labels:
-
Apache Hive
06-19-2017
04:53 PM
Hello, I have a 4 node cluster configured to have 1 Namenode and 3 datanodes. Im performing a TPCH benchmark and i would like to know how much data you think my cluster can handle without affecting query response times. The nodes have 16gb of ram each and 8 cores. My total amount of disk available is ~700GB. Thank you
... View more
Labels:
05-30-2017
07:47 PM
Hello Im performing a tpch benchmark on Apache Drill, when i try to run query 21 (i'll put the code next) it gives the error "UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join". I tried with all tables on the from line but the result is the same. Why is this happening? can someone provide a version of the queries that might work with the same result? <code>SELECT S_NAME, COUNT(*) AS NUMWAIT
FROM hive.tpch_flat_orc_30.supplier, hive.tpch_flat_orc_30.nation
join hive.tpch_flat_orc_30.LINEITEM L1 on S_SUPPKEY = L1.L_SUPPKEY
join hive.tpch_flat_orc_30.ORDERS on O_ORDERKEY = L1.L_ORDERKEY
where O_ORDERSTATUS = 'F'
AND L1.L_RECEIPTDATE> L1.L_COMMITDATE
AND EXISTS (SELECT *
FROM hive.tpch_flat_orc_30.lINEITEM L2
WHERE L2.L_ORDERKEY = L1.L_ORDERKEY
AND L2.L_SUPPKEY <> L1.L_SUPPKEY)
AND NOT EXISTS (SELECT *
FROM hive.tpch_flat_orc_30.lineitem L3
WHERE L3.L_ORDERKEY = L1.L_ORDERKEY
AND L3.L_SUPPKEY <> L1.L_SUPPKEY
AND L3.L_RECEIPTDATE > L3.L_COMMITDATE)
AND S_NATIONKEY = N_NATIONKEY
AND N_NAME = 'SAUDI ARABIA'
GROUP BY S_NAME
ORDER BY NUMWAIT DESC, S_NAME
LIMIT 100;
... View more
Labels:
- Labels:
-
Apache Ambari
05-30-2017
04:53 PM
Hello Im performing a tpch benchmark on Apache Drill, when i try to run query 21 (i'll put the code next) it gives the error "UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join". I tried with all tables on the from line but the result is the same. Why is this happening? can someone provide a version of the queries that might work with the same result? <code>SELECT S_NAME, COUNT(*) AS NUMWAIT
FROM hive.tpch_flat_orc_30.supplier, hive.tpch_flat_orc_30.nation
join hive.tpch_flat_orc_30.LINEITEM L1 on S_SUPPKEY = L1.L_SUPPKEY
join hive.tpch_flat_orc_30.ORDERS on O_ORDERKEY = L1.L_ORDERKEY
where O_ORDERSTATUS = 'F'
AND L1.L_RECEIPTDATE> L1.L_COMMITDATE
AND EXISTS (SELECT *
FROM hive.tpch_flat_orc_30.lINEITEM L2
WHERE L2.L_ORDERKEY = L1.L_ORDERKEY
AND L2.L_SUPPKEY <> L1.L_SUPPKEY)
AND NOT EXISTS (SELECT *
FROM hive.tpch_flat_orc_30.lineitem L3
WHERE L3.L_ORDERKEY = L1.L_ORDERKEY
AND L3.L_SUPPKEY <> L1.L_SUPPKEY
AND L3.L_RECEIPTDATE > L3.L_COMMITDATE)
AND S_NATIONKEY = N_NATIONKEY
AND N_NAME = 'SAUDI ARABIA'
GROUP BY S_NAME
ORDER BY NUMWAIT DESC, S_NAME
LIMIT 100;
... View more
Labels:
- Labels:
-
Apache Ambari
05-18-2017
05:01 PM
Hello, Im performing a tpch benchmark using drill, the queries are stored in sql file and are executed over a hive schema (with tables stored as orc). When i try to run query 22 (ill put the code next), Drill gives an error of "Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join". The same happens in query 22 of tpch, can some tell me how to fix the query? Query code: "SELECT S_NAME, COUNT(*) AS NUMWAIT
FROM SUPPLIER, LINEITEM L1, ORDERS, NATION WHERE S_SUPPKEY = L1.L_SUPPKEY AND
O_ORDERKEY = L1.L_ORDERKEY AND O_ORDERSTATUS = 'F' AND L1.L_RECEIPTDATE> L1.L_COMMITDATE
AND EXISTS (SELECT * FROM LINEITEM L2 WHERE L2.L_ORDERKEY = L1.L_ORDERKEY
AND L2.L_SUPPKEY <> L1.L_SUPPKEY) AND
NOT EXISTS (SELECT * FROM LINEITEM L3 WHERE L3.L_ORDERKEY = L1.L_ORDERKEY AND
L3.L_SUPPKEY <> L1.L_SUPPKEY AND L3.L_RECEIPTDATE > L3.L_COMMITDATE) AND
S_NATIONKEY = N_NATIONKEY AND N_NAME = 'SAUDI ARABIA'
GROUP BY S_NAME
ORDER BY NUMWAIT DESC, S_NAME
LIMIT 100;"
... View more
Labels:
- Labels:
-
Apache Hive
05-10-2017
12:22 AM
I was using Hive-testbench (at https://github.com/hortonworks/hive-testbench) to generate tpch data sets, i started to generate a dataset of 10 gb to hive (./tpch-build.sh 10). Making a select count(*) on the generated hive table "part" it gives a total of 2000000 rows. But meanwhile i decide to download the official tpch_tool 2.17 and generate the 10 gb .tbl files and then build a hive database. For the same data size 10 gb using the newly generated table with the .tbl files the same count query gives a total of 86586082. How is this possible, the number of rows show be the same. Can anyone give an idea of whats going on? Thanks
... View more
Labels:
05-09-2017
12:37 AM
So although it presents itself as .deflate, basicly it's orc? Spark queries can query .parquet files, it will be able to query in these files with deflate format?
... View more
05-08-2017
12:40 AM
Hello, I am Hive-testbench (http://blog.moserit.com/benchmarking-hive) to test some queries. By default, using the ./tpcds-setup.sh 10 what is the file format will my hive tables have (since in hdfs they are listed with a .deflate extesion)? I think the best file formats for performance are either ORC orc parquet, how can i generate in those formats? Thanks
... View more
- Tags:
- Ambari
- Data Processing
- hortonwork
- parquet
- tez
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Ambari
-
Apache Tez
05-03-2017
02:17 PM
The current value is 1073741. I tried to decrease the number to see if the reducers raise but the result is the same.
... View more
05-03-2017
11:41 AM
Hello, Currently i am using Hortonworks HDP 2.6 TO perform a TPCH benchmark with a 10 gb scale factor, but when i execute query 19 it only gives one reducer and the query never ends cause its complex, so i tried to force the number of the reducers with the above commands (set mapred.reduce.tasks = 6; and set mapreduce.job.reduces = 6;), with this, it would be logic to see that 6 reducers where used, but it stays the same with only one reducer. Any ideas why i cant increase the number of reducers
... View more
Labels:
02-13-2017
11:55 PM
I was able to generate the data (10GB), but now that i've runned some queries, i get no results except on Query 1 that returns 4 rows. When i run the queries on hive command line it gives me the output of the mapreduce jobs, but in the end it doesnt return any rows. Can you give me some kind of help?
... View more
02-11-2017
02:46 AM
Good evening, im trying to perform a TPC-H benchmark on hive, i donloaded from .git hive-testbench (https://github.com/hortonworks/hive-testbench) after i build (./tpch-build.sh) i try to generate the data (./tpch-setup.sh 10), but ir gives error saying that dbgen.jar doenst exist (but he exists):~ ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Generating data at scale factor 10.
Exception in thread "main" java.io.FileNotFoundException: File file:/home/centos/hive-testbench-hive14/tpch-gen/target/lib/dbgen.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:425)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2042)
at org.notmysock.tpch.GenTable.copyJar(GenTable.java:163)
at org.notmysock.tpch.GenTable.run(GenTable.java:100)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.notmysock.tpch.GenTable.main(GenTable.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136) ls: `/tmp/tpch-generate/10/lineitem': No such file or directory I already tried to generate sepecifying a directory but the result its the same. Can you give me some kind of help?
... View more
Labels:
- Labels:
-
Apache Hive