Member since
07-08-2016
46
Posts
5
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1192 | 07-21-2016 08:36 AM | |
1167 | 07-12-2016 11:58 AM |
04-09-2019
08:40 AM
Is it possible to treat empty strings (or string with only white spaces) as NULL values in ORC table? I will give the example. I import data from MS SQL to Hive using Sqoop. When I use TEXTFILE everything works good. All empty strings are treated as NULL if I use paremeter: 'serialization.null.format'='' But this parameter does not work with ORC tables. Is there any equivalent of this parameter in ORC format?
... View more
Labels:
03-11-2019
08:05 AM
Is there any way to use list or regex in Hive table parameter 'serialization.null.format'? I know that I can use single character as null by using this command: tblproperties('serialization.null.format'='null') But what if I want Hive to treat two characters as null? For example: tblproperties('serialization.null.format'=['null', '\\N']);
OR
tblproperties('serialization.null.format'=' *'); Does anyone know if it is possible?
... View more
- Tags:
- Data Processing
- Hive
Labels:
01-28-2019
11:27 AM
Hi. I have a question regarding to Hadoop architecture. I have 10 node cluster and I want to create some kind of sandboxes inside the cluster. What do I mean by sandbox? Separate space from the whole cluster resources when business users could create some temporary databases, files and run jobs. A simple solution would be creating a technical user for every sandbox, but I don't want to do it that way. Business users have their own accounts and I want them to run jobs using those accounts. Maybe the picture will say more: As you can see, the problem is that some jobs should be run by users in main cluster space and some of jobs should be run in specific sandbox. The question is - how I can achieve this? Does anyone have the idea?
... View more
Labels:
06-20-2018
02:39 PM
Hi. I have a problem with exporting Hive table to Oracle database. I wanna encrypt and hide password using jceks. I read great article about using jceks while importing data using Sqoop: Storing Protected Passwords in Sqoop It works great when I import data from Oracle to Hive. But the problem is that when I try to export data from Hive to Oracle I get an error: Unable to process alias My Sqoop command which I try to run: sqoop export \
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/hdfs/pass-enc.jceks \
--connect jdbc:oracle:thin:@1.1.1.1:2222:SID \
--table hive_temp_table_orc \
--username orc_user \
--password-alias oracle.password \
--hcatalog-database default \
--hcatalog-table hive_temp_table \
--hive-partition-key col1 \
--hive-partition-value 2011-01-01 My question is - is that possible to use jceks and --password-alias parameter with Sqoop export command? Or is it an option only when I importing data?
... View more
Labels:
02-13-2018
08:37 AM
You should log in to your machine as a root user and install nc package. I don't know which OS you are using, but if it is CentOS you should execute this command (on every host machine): yum install nc
... View more
01-09-2018
11:14 AM
@Lav Jain Thanks for the answer. I checked my pxf-private.classpath. The problem was, that I installed PXF on all data nodes, but there are no Hive Clients installed (no Hive jars too). I copied jars on all machines and it works now. But there is another problem. I have a problem with querying tables which have different format than textfile. If I execute query: select count(*) from hcatalog.default.sales_info; And the table sales_info is text file I got correct results: count
-------
7540
(1 row) But if table format is ORC I got this error: ERROR: pxfwritable_import_beginscan function was not found (nodeExternalscan.c:310) (seg0 Node4:40000 pid=42109) (dispatcher.c:1805) Does anyone know how to query ORC data? And I have still a problem with external tables. When I create a table: CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://Node1:51200/tmp/pxf_test/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple')
FORMAT 'TEXT' (delimiter=E','); I am not able to query this table. There is no error, I can see only this: ERROR: remote component error (0):(libchurl.c:897) I checked HAWQ segments logs, HAWQ master logs, Catalina and PXF-service logs, but I cannot find anything about this error. I would appreciate if anyone could help me.
... View more
01-08-2018
09:25 AM
I have a problem with PXF and external tables in HAWQ. I have seven data nodes in cluster (and seven HAWQ segments) and I installed PXF on each of them. It looks like that: Node1 - NameNode, HiveMetastore, HAWQ master, Hive Client, HCat Client, PXF Node2 - SNameNode, DataNode, HAWQ Segment, PXF Node3-7 - DataNode, HAWQ Segment, PXF Node8-9 - HiveClient, HCat Clients I created table based on "Test PXF" on this site. But I have a problem with accesing data. When I try to run a simple query: SELECT * FROM pxf_hdfs_textsimple; I get error: ERROR: remote component error (0): (libchurl.c:897)
That's all. In HAWQ Master Node Log I see this: 2018-01-04 18:43:06.902998 CET,"hawq","postgres",p19781,th-319940160,"[local]",,2018-01-04 18:43:06 CET,16768,con26,,seg-10000,,,x16768,sx1,"LOG","00000","no master mirroring standby configuration found",,,,,,,0,,"cdblink.c",159,
2018-01-04 18:43:10.759145 CET,"hawq","poligon",p19820,th-319940160,"[local]",,2018-01-04 18:43:10 CET,0,con27,,seg-10000,,,,,"LOG","00000","getLocalTmpDirFromMasterConfig session_id:27 tmpdir:/tmp",,,,,,,0,,"postinit.c",470,
2018-01-04 18:43:20.600046 CET,"hawq","poligon",p19820,th-319940160,"[local]",,2018-01-04 18:43:10 CET,16773,con27,cmd7,seg-10000,,,x16773,sx1,"ERROR","XX000","remote component error (0): (libchurl.c:897)",,,,,,"select * from pxf_hdfs_textsimple ;",0,,"libchurl.c",897,"Stack trace:
1 0x8c165e postgres errstart (??:?)
2 0x8c34fb postgres elog_finish (??:?)
3 0x5124d6 postgres check_response_code (??:?)
4 0x512686 postgres churl_read_check_connectivity (??:?)
5 0x517b22 postgres <symbol not found> (pxfutils.c:?)
6 0x517d66 postgres call_rest (??:?)
7 0x5168c0 postgres <symbol not found> (pxfmasterapi.c:?)
8 0x516f97 postgres get_data_fragment_list (??:?)
9 0x512ff5 postgres map_hddata_2gp_segments (??:?)
10 0x73f8a2 postgres <symbol not found> (createplan.c:?)
11 0x73fdb5 postgres <symbol not found> (createplan.c:?)
12 0x741dec postgres create_plan (??:?)
13 0x74d1a6 postgres <symbol not found> (planner.c:?)
14 0x74eb3c postgres subquery_planner (??:?)
15 0x74f177 postgres <symbol not found> (planner.c:?)
16 0x74f72e postgres planner (??:?)
17 0x7e496a postgres pg_plan_queries (??:?)
18 0x7e4e05 postgres <symbol not found> (postgres.c:?)
19 0x7e6560 postgres PostgresMain (??:?)
20 0x799860 postgres <symbol not found> (postmaster.c:?)
21 0x79c5e9 postgres PostmasterMain (??:?)
22 0x4a2dff postgres main (??:?)
23 0x7fdfe9e2ab35 libc.so.6 __libc_start_main (??:0)
24 0x4a2e7c postgres <symbol not found> (??:?)
" When I try access data in Hive I get error too. Query: select count(*) from hcatalog.default.aaa; Error: ERROR: remote component error (500) from '127.0.0.1:51200': type Exception report message java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/MetaException description The server encountered an internal error that prevented it from fulfilling this request. exception javax.servlet.ServletException: java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/MetaException (libchurl.c:897)
LINE 1: select count(*) from hcatalog.default.aaa;
Does anyone know what I am doing wrong? What can cause a problem with accesing data from HDFS, Hive by PXF?
... View more
Labels:
10-19-2017
08:00 AM
@mqureshi But the problem is that I don't have subscription so I don't have access to Smartsense. What in that case?
... View more
10-18-2017
03:41 PM
Hi. I have a qestion regarding hdp-conf-utils script and Ambari recommendations. I installed 8 node managers on my cluster. Node hardware spec: cores - 4, ram - 15, disk - 4. I execute hdp-conf-utils.py script and I got something like that: Using cores=4 memory=15GB disks=4 hbase=False
Profile: cores=4 memory=14336MB reserved=1GB usableMem=14GB disks=4
Num Container=8
Container Ram=1792MB
Used Ram=14GB
Unused Ram=1GB
***** mapred-site.xml *****
mapreduce.map.memory.mb=1792
mapreduce.map.java.opts=-Xmx1280m
mapreduce.reduce.memory.mb=3584
mapreduce.reduce.java.opts=-Xmx2560m
mapreduce.task.io.sort.mb=640
***** yarn-site.xml *****
yarn.scheduler.minimum-allocation-mb=1792
yarn.scheduler.maximum-allocation-mb=14336
yarn.nodemanager.resource.memory-mb=14336
yarn.app.mapreduce.am.resource.mb=1792
yarn.app.mapreduce.am.command-opts=-Xmx1280m
***** tez-site.xml *****
tez.am.resource.memory.mb=3584
tez.am.java.opts=-Xmx2560m
***** hive-site.xml *****
hive.tez.container.size=1792
hive.tez.java.opts=-Xmx1280m
hive.auto.convert.join.noconditionaltask.size=402653000
I wanted to set this recommendations to YARN, but Ambari recommends me something else: yarn.nodemanager.resource.memory-mb=5120
yarn.scheduler.minimum-allocation-mb=512
yarn.scheduler.maximum-allocation-mb=5120
Can anyone explain me why Ambari and hdp-conf-utils recommend something else? I would be really grateful.
... View more
Labels:
10-17-2017
03:05 PM
I saw both of this sites. But my question is - how I can run stack-advisor.py script? Can I just run simple: python stack-advisor.py or I should configure something before? Where I can find a file with recommendations?
... View more
10-17-2017
09:28 AM
Hi. I have a question regarding Stack Advisor script. I want to tune my cluster - change YARN, Tez and Hive configuration files. I can use hdp-configuration-utils.py script, but it is too simple. I read that there is stack and service advisors which can give me recommendations. My question is - can I use stack advisor script after HDP installation or I should not use it, cause Ambari uses it during HDP installation only? And if I can use it - how can I do this?
... View more
Labels:
10-06-2017
09:21 AM
Hi. Configuration management tools are really popular these days. I wanted to deploy one of them, for example Ansible, on my cluster. Is there a way to deploy Ansible (or other CM tool) to existing HDP cluster? I found a couple of articles about Ansible and HDP, but all of them show how to deploy HDP cluster using Ansible. But what about existing cluster?
... View more
09-29-2017
09:45 AM
@ywang Thanks for the answer. But the problem is that when I changed parameter: run_as_user=ambari-pkt I could not run a couple of services, for example Oozie: I noticed that te ambari-agent does not have permission to directory: /var/lib/ambari-agent/ This is how the directory looks like: The question is - what I should do? Remove ambari-agent and install it again or simple chown is enough? chown -R ambari-pkt /var/lib/ambari-agent<br>
... View more
09-29-2017
08:41 AM
Hi. I want to change ambari-agent user to non-root. Documentation says that I should manually install ambari-agent. What if I installed it using Ambari (it is not manually install)? Do I have to remove ambari-agent and install it manually then to configure it for non-root user?
... View more
Labels:
09-28-2017
07:40 AM
I have setup https for Ambari. It works good, but I have a problem with redirecting. When I try http url, I see an error that page is not available. The page works fine when I try https url. Is there any way to redirect Ambari http url to https?
... View more
Labels:
09-25-2017
12:50 PM
SmartSense looks great. If there is a free version or only paid one?
... View more
09-20-2017
02:11 PM
Thanks for the answer. I replaced root with ambari and everything is fine. Can you tell me one more thing? What abou sudo configuration? Is it necessary to change it? Because I started ambari-server and it works good. There is one error in logs: Unable to check firewall status when starting without root privileges.
Please do not forget to disable or adjust firewall if needed
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start..../bin/sh: line 0: ulimit: open files: cannot modify limit: Operation not permitted
... View more
09-20-2017
12:44 PM
@Geoffrey Shelton Okot I added those two lines to custom core-site hadoop.proxyuser.ambari.groups=*
hadoop.proxyuser.ambari.hosts=* Do you know if I have to delete parameters for root users from custum core-site? I mean those two lines: hadoop.proxyuser.root.groups=*
hadoop.proxyuser.root.hosts=*
... View more
09-20-2017
11:01 AM
I installed HDP using Ambari as root user. Due to security I want to change it. As I read, there is no problem with running Ambari Agent as non-root user (How to Configure an Ambari Agent for Non-Root). But what about Ambari Server? During the Ambari Server setup process, when prompted to Customize user account for ambari-server daemon? , I chose n . Is there any way to change user for Ambari Server? Or do I have to setup Ambari Server one more time?
... View more
Labels:
09-15-2017
12:03 PM
Hi. I am looking for some script which can help me configure my cluster based on cluster hardware. I know that there is a script hdp-configuration-utils (hdp-configuration-utils on GitHub) but it is old and I don't know if it does make any sense to use it. My question is - is there any tool or script which can help configure HDP cluster? If hdp-configuration-utils is still good option for that kind of task?
... View more
Labels:
09-15-2017
08:06 AM
@Aravindan Vijayan 1. I see the same values, so it works good. 3. But when I added sizes of 3 disks from the first one I got 99880680 KB. The second one shows 95.25. What is the reason of the difference between them?
... View more
09-14-2017
01:16 PM
Hi. I got a question which is associated with Ambari API. I want to run a script hdp-configuration-utils, but I need a couple of information - number of cores, memory, disks and HBase enabled (I did not install it so value is 'False'). My questions: 1. When I run command: GET api/v1/clusters/c1/hosts I get parameter names 'cpu_count' and 'ph_cpu_count'. Which one should I use? 2. How can I check what is number of disks? 3. How can I get info about free and total disk size? I got two parameters: - disk_info "disk_info" : [
{
"available" : "42331676",
"device" : "/dev/mapper/VolGroup-lv_root",
"used" : "6521952",
"percent" : "14%",
"size" : "51475068",
"type" : "ext4",
"mountpoint" : "/"
},
{
"available" : "423282",
"device" : "/dev/sda1",
"used" : "38770",
"percent" : "9%",
"size" : "487652",
"type" : "ext4",
"mountpoint" : "/boot"
},
{
"available" : "45423700",
"device" : "/dev/mapper/VolGroup-lv_home",
"used" : "53456",
"percent" : "1%",
"size" : "47917960",
"type" : "ext4",
"mountpoint" : "/home"
}
] - metrics/disk "disk" : {
"disk_free" : 83.99,
"disk_total" : 95.25,
"read_bytes" : 1.9547998208E10,
"read_count" : 1888751.0,
"read_time" : 2468451.0,
"write_bytes" : 1.5247885312E10,
"write_count" : 2020357.0,
"write_time" : 9.9537697E7
} Which one should I check when I want to compare it with offcial sizing recomendations?
... View more
Labels:
07-18-2017
10:01 AM
It works! Thank you 🙂
... View more
07-17-2017
11:49 AM
1 Kudo
Hi. I have a problem with Spark 2 interpreter in Zeppelin. I configured interpreter like this:
When I run query like this: %spark2.sql
select var1, count(*) as counter
from database.table_1
group by var1
order by counter desc Spark job runs only 3 containers and job takes 13 minutes. Does anyone know why Spark interpreter takes only 4.9 % of queue? How I should configure the interpreter to increase this factor?
... View more
Labels:
03-22-2017
08:55 AM
@yvora But the problem is that because of Zeppelin, processing time in q_apr_general queue is longer. This is weird because processes are in different queue and YARN should reserve resources available for that queue, not more. I set up max limit but it won't help. Do you have another ideas?
... View more
03-21-2017
04:18 PM
Hi. I've got a problem with YARN and Capacity Scheduler. I created two queues: 1. default - 60% 2. q_apr_general - 40% There is one Spark Streaming job in queue 'q_apr_general'. Processing time for every single batch is ~2-6 seconds. In the default queue I started Zeppelin with preconfigured resources. I added to zeppelin-env.sh one line: export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.4.2.0-258 -Dspark.executor.instances=75 -Dspark.executor.cores=6 -Dspark.executor.memory=13G The problem is that when I execute Spark SQL query in Zeppelin, processing time is ~20-30 seconds. It is weird for me, because Zeppelin process and Spark streaming are in different queues. Spark streaming process should not depend on Zeppelin process in another queue. Does anyone know what is the reason of my problem?
... View more
Labels:
01-20-2017
12:32 PM
You should check this answer on my question: The job is complete, but has status running 1. You can change value of tez.session.am.dag.submit.timeout.secs parameter (as you can see in the link above). 2. You can give more resources to default queue.
... View more
12-02-2016
11:49 AM
Thank you for the reply. I updated my post.
... View more
11-24-2016
01:48 PM
Hi. I have three Spark streaming applications. One of them saves data to Hive table (parquet format). The other two read data from that table and cache it every hour at the same moment. Both of them have the same code to read data. Writing to table and reading from it is never done at the same time. After couple of hours one of them seems to read only part of data. You can see below how it looks in Storage tab. First application reads all data. The second one omnited one partition. Do you know what is the reason of this issue?
UPDATE Read: Two reading applications have the same part of code, which cache table. sqlContext.clearCache()
df = sqlContext.sql('select timestamp, col2 from table where timestamp > time')
df.cache() This code is executed every hour. Write: df.registerTempTable('temp')
sqlContext.sql("INSERT INTO TABLE table SELECT * FROM temp")
Code above is also executed every hour. I checked and I am sure that these codes finish properly. Every hour when I insert new data into table Hive creates new partition for it, so every partition have data for only one hour. When I read table I want data from last hour (last added partition). The problem is that Spark streaming seems to not update number of partition in table. If we have desired data in one partition there's a chance Spark will not read it. Maybe I should do something with the way I insert data into Hive? Spark version: 1.6
... View more
Labels:
11-09-2016
02:01 PM
1 Kudo
Hi. I'm trying to install Impala in my cluster. I found two ways to do that: 1. HDP + Impala. There is a problem with two libraries: Error: Package: impala-shell-2.7.0+cdh5.9.0+0-1.cdh5.9.0.p0.32.el6.x86_64 (cloudera-cdh5)
Requires: libpython2.6.so.1.0()(64bit)
Error: Package: impala-2.7.0+cdh5.9.0+0-1.cdh5.9.0.p0.32.el6.x86_64 (cloudera-cdh5)
Requires: libsasl2.so.2()(64bit)
I don't know where is the problem. I think it might be a problem with OS or differences between HDP and CDH. 2. Official wiki instruction. But, as you can see, prerequisite is Ubuntu. I use CentOS 7. Does anyone know alternative way to install Impala? My cluster: HDP 2.4, CentOS 7
... View more
Labels: