Member since
09-26-2014
44
Posts
10
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4733 | 02-19-2015 03:41 AM | |
801 | 01-07-2015 01:16 AM | |
6945 | 12-10-2014 04:59 AM | |
3800 | 12-08-2014 01:39 PM | |
3875 | 11-20-2014 08:16 AM |
09-14-2018
03:07 AM
Queries via jdbc client. Yes tried refresh, but after ~ 30 min all the queries appeared.
... View more
09-14-2018
01:15 AM
Hi, I have a brand new installation of CDH 5.15, where all services (MNG,Impala) are in green. I executed ~10 queries on Impala, and was checking the queries in Cloudera Manager -> Impala Queries. I noticed two issues: - Some of the queries were in the list when there were in "running" state - After the statements were finished, no query were reported in Impala Queries. The obvious reason could be time filter: I checked 30m, 1h, 2h, 1d, still NO results. Also another obvious reason can be a search filter. The filter is empty, NO results. I checked the Service Monitor of CM, it is in green, so I suppose it collects data. I checked the Impala storage ( firehose_impala_storage_bytes) it is 1GB. The only "warning" is about the memory of Service Monitor what is much less, but this is a new cluster, no workload running, and the CM reports that the heap usage is under 1G The recommended non-Java memory size is 12.0 GiB, 10.5 GiB more than is configure What could be the problem of the empty list? Why Cloudera Manager is not collecting the Impala queries? Or maybe it is, but then why it is not reporting, showing me them? Thanks
... View more
Labels:
09-12-2018
10:28 AM
How does the data looks like? I think the json has to be in one row (so cant contain newlines) and you have to have one json per line. At least I had a similar issue when I wanted to load a data via external table, where the json contained one big list with many dict elements.
... View more
09-12-2018
10:24 AM
You should use ntp or chrony to synchronize clocks. If they are used, and the clocks are out of sync, maybe some issue is on the network. Regarding the Hbase restart, I would do a Stop -> then check on all nodes that no hbase is running and then start.
... View more
04-08-2015
03:00 AM
Hi, we installed 64bit ODBC driver from DataDirect for Impala and tried to establish a connection between SQL Server 2014 (running on Windows Srv 2012R2) and Cloudera Impala. After setting up the ODBC driver, the test connection was ok. But the linked server is not working, listing tables works, but a simple select statmenet returns this kind of error: OLE DB provider "MSDASQL" for linked server "IMPALA" returned message "Unspecified error". Msg 7311, Level 16, State 2, Line 1 Cannot obtain the schema rowset "DBSCHEMA_COLUMNS" for OLE DB provider "MSDASQL" for linked server "IMPALA". The provider supports the interface, but returns a failure code when it is used. I also contacted the technical team from Progress Software but no response yet, Any ideas?
... View more
03-27-2015
12:48 AM
Created a case from this issue, hopefully the engineering team will come back with a solution Tomas
... View more
03-23-2015
02:16 AM
Hi, we are trying to download a bulk of data from CDH cluster via Windows ODBC Driver for Impala version 2.5.22 to a Windows server. The ODBC driver works well, but the performance of rows dispatching is really bad - roughly 3M rows/minute. We checked the possible bottlenecks for this kind of download, but the cluster and also the receiving Windows server were not under load at all, the cpu around 5%, the network cards running on 10Gbit, there are plenty of RAM memory, the target disk where the data is written is RAID-0 SSD with 1GB/s max throughput, so we dont know what component on the trasnfer slows down the records. We tried to run in multiple parallel threads, what helped a little bit (50% perf increase) but the overall perf is still low.. Also tried to tweak the transfer batch size in ODBC driver, it looks that it doesnt affect the performance at all. The setup is CDH5.3, and Microsoft SQL Server 2014, the Impala is linked via linked server in MS SQL. Any ideas how to increase the transfer speed? Thanks Tomas
... View more
02-19-2015
03:44 AM
I have a simple pig program, with a simple LOAD and STORE/DUMP statement, but it refuse to load the test data file. The path in HDFS is in /user/dwh/ the file is called test.txt. I assume the pig is not aware of the HA setting of my cluster, Any ideas? Input path does not exist: hdfs://nameservice1/user/dwh/test.txt
... View more
02-19-2015
03:41 AM
1 Kudo
I found the piggybank.jar in /opt/cloudera/parcels/CDH/lib/pig/. The problem was in fact that when I called register piggybank, the grunt shell gave me this exception: grunt> REGISTER piggybank.jar 2015-02-19 12:38:49,841 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-02-19 12:38:49,849 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: file 'piggybank.jar' does not exist. after changing the directory into the lib path the register worked well.. or use: REGISTER /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar Tomas
... View more
02-19-2015
03:20 AM
Hey guys, does cloudera packs piggy UDF into the CDH? I tried to find in the distribution anything called piggybank, but was not succesfull. Can somebody advice me how to add the piggybank UDFs into the existing pig installation in CDH? https://cwiki.apache.org/confluence/display/PIG/PiggyBank Thanks. Tomas
... View more
- Tags:
- Pig
02-13-2015
03:37 AM
1 Kudo
Hi, I tried to open (LOAD) paruqet file (table created by Impala) in PIG. The table has several integer columns, string columns (changed to chararray) and timestamp column. The problem is in reading the timestamp columne, the error what we get is this: parsing: Error during parsing. can't convert optional int96 charging_start_time Failed to parse: can't convert optional int96 charging_start_time at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1676) at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1409) at org.apache.pig.PigServer.parseAndBuild(PigServer.java:342) at org.apache.pig.PigServer.executeBatch(PigServer.java:367) at org.apache.pig.PigServer.executeBatch(PigServer.java:353) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769) I tried to change the definition in LOAD statement to chararray, int, int96, float, but nothing helped. Any ideas how to overcome this problem?
... View more
01-23-2015
12:55 PM
The only way I can recommend is to unzip the file , add the column, zip it and upload. pagecounts-20090430-230000.gz -> pagecounts-20090430-230000 cat pagecounts-20090430-230000 | awk '{ print $0",pagecounts-20090430-230000.gz" }' > output zip the output upload the output And of course, you have to script it to do for every file..
... View more
01-23-2015
12:47 PM
5 Kudos
This warning message keeps appearing in the log during query execution. Even if the invalidate metadata statement is done this warning does not dissapear. Tried INVALIDATE METADATA xxxx and as well INVALIDATE METADATA (without specif the table name) Didnt helped, Any thoughts? Backend 1:Block locality metadata for table 'xxxx' may be stale. Consider running "INVALIDATE METADATA xxxx"
... View more
01-15-2015
07:36 AM
We have an Impala 2.1 with ODBC connector 2.5.20. This connection is linked to MS SQL server. The problem is, that queries with larger datasets (such as thousands of rows) are running extremely slow. Queries such as (select count(*) from table) runs ok, because the query is executed in Impala and then the result is returned and fetched by MS SQL Server - 1 row.
Is there any setting or configuration property in the ODBC connector to improve the performance for queries returning larger datasets?
Thanks
Tomas
... View more
Labels:
01-08-2015
02:17 AM
For locking down the hadoop cluster the usual recommendation is to set up Sentry and turn on firewalls (remember, during installation and setup the firewalls should be turned off - this is a recommendation from Cloudera). But when it comes to a lock down, meaning the firewall configuration, it is really hard to figure out, what communication (protocol/port) should and what communications should not allowed on the nodes. It would be nice to have a new feature in Cloudera Manager with a wizard for the definition of IP tables. I think we are not alone with this request/suggestion Tomas.
... View more
01-08-2015
12:44 AM
Hi Visahl, no idea what would be the problem, in my case the HIVE connector simply did not read the metadata correctly. I am sorry, but I cannot help you Good luck T
... View more
01-07-2015
01:16 AM
This issue - with reading large tables compressed by Impala - was (based on my experiences) solved in the release of Impala 2.1 (CDH 5.3.1) Cloudera did not confirm this as a bug - when I tried to arrange a conf call with cloudera support and they tried to investigate where is the problem - they were not able define what is the root cause of this bug. I assume that this changed helped to solve the problem (Impala 2.1.0 release notes): The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data as it is read, rather than reading the entire gzipped file and decompressing it in memory But this is not confirmed, after upgrade Impala did not crash anymore. T
... View more
01-07-2015
01:12 AM
More interestingly this differencce dissappeared after upgrading to CDH 5.3.1. T.
... View more
01-06-2015
06:21 AM
Anybody has an experience how to process xml data (imported from MSSQL) and how to store and analyze them in Impala? Thanks Tomas
... View more
12-10-2014
04:59 AM
During my test I came to one (maybe not correct) conclusion. The table is big and partitioned, and maybe Impala just limits the query to a subset of a table. Because if I change the query like create table result as select * from tmp_ext_item where item_id in ( 3040607, 5645020, 69772482, 2030547, 1753459, 9972822, 1846553, 6098104, 1874789, 1834370, 1829598, 1779239, 7932306 ) then it runs correctly and returns all items with the specified item_id.
... View more
12-08-2014
01:39 PM
I solved the issue with from_utc_timestamp(Create_Time, 'CEST'). Impala assumes the the timestamp value is stored in UCT. So converting to central european time with summery daylight saving will produce the correct result. As far as I know there is no way to tell Impala that the current timezone is CEST, so in every query this conversion should be made.
... View more
12-08-2014
04:51 AM
Hi, running a simple query where in the WHERE condition is a column IN ( ) condition and the list contains 13 elements (numbers). The column is type of int. Every time I run a query I got a different result, sometimes 5 rows, sometimes 2 rows, sometimes 10 rows. Of course I checked ID by ID that all elements are in the table... is this a known bug or I am missing something? select * from tmp_ext_item where item_id in ( 3040607, 5645020, 69772482, 2030547, 1753459, 9972822, 1846553, 6098104, 1874789, 1834370, 1829598, 1779239, 7932306 ) T.
... View more
11-28-2014
02:47 AM
1 Kudo
Hi, some external tables created by sqoop are not readable in Impala. Even though the actual version (2.0) support gzip format, access (meaning selecting) external tables causes crash of several, sometimes all Impala daemons. In he ERROR log of the impalad is only errors related to Connection refuse Cancelled due to unreachable impalad(s) The cluster automaticly restores from this state and after a while the Impalad instances are up and running. But the query is not working. The interesting thing is, that this behaviour is only on external tables loaded by one specific user. I also tried to set access permissions on that users directory to +rwx, but didnt helped. Can anybody help please with this?
... View more
11-26-2014
03:25 PM
1 Kudo
Just insalled the latest version of HIVE ODBC driver (2.5.12) and created a linked server in Microsoft SQL Server 2014. Tried a simple select * from table, or even one integer column such as select ID from table, but the queries failed with this OLE DB provider "MSDASQL" for linked server "CLOUDERA-HIVE" returned message "[Cloudera][HiveODBC] (35) Error from Hive: error code: '0' error message: 'ExecuteStatement finished with operation state: ERROR_STATE'.". Msg 7320, Level 16, State 2, Line 4 Cannot execute the query "SELECT "Tbl1002"."row_id" "Col1004" FROM "HIVE"."default"."test_table" "Tbl1002"" against OLE DB provider "MSDASQL" for linked server "CLOUDERA-HIVE". Any ideas? Thanks, Tomas PS. Tried to enable Fast SQL Prepare, but didnt help. Tried to enable Use Async exec - didnt help. Native query - dont work on MS SQL.
... View more
11-26-2014
02:17 PM
I have a same issue, the same query returns different dates. In impala the date is one hour less than in Hive. Table was created in hive, loaded with data via insert overwrite table in hive (table is partitioned). And for example the timestamp 2014-11-18 00:30:00 - 18th of november was correctly written to partition 20141118. But when I fetch the table in impala, whith condition day_id (partition column) = 20141118 I see a value 2014-11-17 23:30:00 So the difference is one hour. If I query the minimum and maximum start_time from the table in one partition in the Imapal (partition day_id = 2014118) I get this wrong result: min( start_time ) = 2014-11-17 23:00 max( start_time ) = 2014-11-18 22:59 when I run the same query in Hive the result is ok: min( start_time ) = 2014-11-18 00:00 max( start_time ) = 2014-11-18 23:59 Any help?
... View more
11-20-2014
08:16 AM
Works great! Simply setting the --class-name overrides the name of the jar file. Thanks!
... View more
11-19-2014
12:06 PM
Have you changed somethin in directory or file permissions in /var/run? If yes, you should probably reconfigure YARN to use a NEW directory (for example if YARN used /data/yarn/nm for NodeManager, configure a new path as /data/yarn/nm2) After setting changing EVERY directory for YARN and restarting the Cluster the YARN started, created the new directories and set the permissions correctly, so now we dont have this kind of problem with permissions. If you didnt change any permission in the local file system, then I dont know what is the issue. Try another user - such as run for example a hive job under root/hdfs/yarn or other user, to see whether this is user related or it fails always. T.
... View more
11-19-2014
11:50 AM
Hi guys, have anybody tried to rename the output of the sqoop import command? It is always named as QueryResult.jar. When we run multiple sqoop import commands in parallel, in Cloudera Manager the Yarn applications does not distinct between them, every command is named as QueryResult.jar. The sqoop import command looks like: sqoop import --connect jdbc:sqlserver://OUR.SQL.SERVER.COM --username XXX --query 'select * from XXXXZZZ where Start_Time >= getdate()-7 and $CONDITIONS' -m 6 --split-by Start_Time --as-textfile --fields-terminated-by '\t' --delete-target-dir -z --target-dir /user/sql2hadoop/ext_xxxxzzzz sqoop import --connect jdbc:sqlserver:// OUR.SQL.SERVER.COM --username XXX --query 'select * from XXXXYYY where Start_Time >= getdate()-7 and $CONDITIONS' -m 6 --split-by Start_Time --as-textfile --fields-terminated-by '\t' --delete-target-dir -z --target-dir /user/sql2hadoop/ext_ xxxxyyyyyy I would like to see in YARN that for example there are two applications running: Import_XXXZZZ.jar and Import_XXXXYYY.jar Is there any parameter for setting the application name? Thanks
... View more
11-12-2014
07:51 AM
The ownership was created during the launch of the application master, even after setting 777 on the parent directory the problem did not dissapear. We had to reinstall the whole cluster from screth 😞
... View more