Member since
11-03-2015
32
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
10569 | 11-30-2015 03:20 AM |
07-18-2018
02:56 AM
Your point is flawless, I think the issue here (at least at my side) is that the workbench (which I tested in a bootcamp run by Cloudera an year ago) is pretty good, but isn't cheap also. For labs, developments and all that stuff it is not affordable for a small Company. In my case, my Company (consultancy) need to be able to develop a new product or service that makes use of ML techniques and would be best developed in a "shared notebook" fashion. The result would be probably sell to the customer together with the workbench, but of course we need to develop it first, with no guarantee of success. Although we are Cloudera resellers, there's no guarantee the Customer also wants to buy the CDSW license (maybe a "developer license" would cover this gap). That's why we need to switch to inexpensive software like Zeppelin and Livy to get the job done, at least in alpha stage. This is my point of view. Take care, O.
... View more
07-18-2018
01:26 AM
Ok I understand your point but what if mappers are failing ? Yarn already sets up as many mappers as files number, should I increase this more ? Since only a minority of my jobs are failing, how can I tune yarn to use more mappers for these particular jobs?
... View more
05-02-2018
03:39 AM
Hi, getting back on this old topic to have more answers in this subject. I have errors with mappers and reducers falling short on memory. Of course increasing the memory fix the issue, but as already menthioned I am wasting memory for jobs that doesn't need it. Plus, I was thinking that this stuff was made to scale, so it would handle a particularly great job just buy splitting it. In other words, I don't want to change memory values every time a new application fails due to memory limits. What is the best practice in this case? Thanks O.
... View more
07-21-2017
07:37 AM
Please don't consider my previous message. While the aforementhioned error IS showing, it is not a blocking issue. I had issues with firewall. Bye
... View more
07-21-2017
02:38 AM
Hey, just to point out that this issue arise also when following path B. Steps to reproduce (Centos 7.3, Manager version 5.12.0-1): 1. Manual install JDK on nodes 2. Grab cloudera-manager.repo file 3. Install via yum yum install cloudera-manager-daemons cloudera-manager-server 4. Change db.properties to fit external mySQL databases 5. Start Cloudera manager systemctl start cloudera-scm-server. And then it hangs with this error: ERROR ParcelUpdateService:com.cloudera.parcel.components.ParcelDownloaderImpl: Failed to download manifest. Status code: 404 URI: https://www.cloudera.com/downloads/manifest.json So it is automatically pointing to this location, I haven't changed the parcels location yet. AFAIK you must change this default location, otherwise it will not work. Question is: how do I do that ???
... View more
06-20-2017
09:04 AM
Hi, I need to sqoop about 700 tables from 2 Oracle instances and I am using a custom query to extract them. To accelerate a bit more the process, I set --fetch-size 2000000 on Sqoop. I have a file with a table on every line, plus some arguments and the query. I built a shell script that uses GNU Parallel to run more than one offload at the same time. It works correctly, however I don't understand why I need to tune the Heapsize of the processes, otherwise it fails with OOM. I understand that Sqoop uses the HDFS client to write data to HDFS, and since I force Sqoop to fetch 2mil records per time, I need to tune the process to have room for them all. So I tune HDFS client via HADOOP_CLIENT_OPTS="-Xmx6144m $HADOOP_CLIENT_OPTS" inside the script, and the Sqoop process's heapsize via -Dmapreduce.map.memory.mb=8192 and -Dmapreduce.map.java.opts=-Xmx6553m in the sqoop import command. My point is: why some tables complete and other don't ? Why it just can slow down to keep pace? I don't like this approach because as soon as a table grows larger, Sqoop will fail. I can't go to production with something that I know in advance will break in the future.
... View more
06-20-2017
08:25 AM
Hi, I am importing from Oracle using Sqoop1 (version 1.4.6, CDH 5.7.4). I cannot use Sqoop to write directly to the destination Hive table, since this table has got a custom type mapping, that is different from the one used by Sqoop. To achieve this, I offload data to a temporary uncompressed parquet table, and then I load the destination table using Hive (beeline). This way I can compress it (snappy) and fix the datatypes on the fly. This works correctly. The problem is that I have some tables where one or two fields contains some special chars that break my table. I know this because, after a lot of debug, I have a working solution using Oracle's replace function to replace newlines, tabs and carriage return with a space (' '). So, for the fields I know to be problematic, I write a query that replaces those chars upon extraction, and it works fine. However, this is clearly the wrong approach, since I can have other fields with dirty chars, and also other dirty chars apart from those 3 I am replacing. Also, I moved from flatfile to binary (parquet) for this very reason: to not have to bother with special chars. Isn't Parquet supposed to not to care about the data contained in the fields ?? What am I missing ? What can I do to fix this ? For now, avro is not an option (SAS is not compatibile with it). Thanks
... View more
05-30-2017
02:57 AM
Hi, I'm replying to this question to know if on CDH 5.7.6 dynamic service discovery is supported. I successfully setup 2 HS2 istances behind an external load balancer, but I'm worried about concurrency. Please provide some news, or at least tell me if there's no problem without it. Thanks
... View more
05-02-2017
11:20 AM
Hi, did you manage to resolve this ? I've got the same error
... View more
04-06-2017
12:26 AM
Kudos to you for pointing this out. Fortunately the customers where I'd like to introduce this doesn't use Oozie, but it certainly is a problem. What sounds really strange to me is: are we the only 3 persons that tried to figure this out ? Is everyone else just using a single DB instance, or are they all using Oracle ???
... View more
01-26-2017
06:38 AM
Thanks, I have the same problem. Could anyone explain if it is correct to manually install new packages on every node ? Is this a functionality you aquire with commercial Anaconda only? Thanks
... View more
07-22-2016
01:59 AM
Hello, is there a way to audit users login in Cloudera Manager ? We'd like to audit successful and failed logins. Is there such metric in CM ? Would it be possible to retreive them via syslog or something ? Thanks, bye Omar EDIT: I found that CM writes on syslog also, so I'm ok with it.
... View more
07-17-2016
02:52 AM
Hello, I have this question too and if you don't mind, I'd like to add some other considerations. I see that CDH services usually declares compatibility to Oracle, MySQL and Postgres. However, not all of them supports those three (Hue for instance), and looking closely only MySQL seems to be the one very cross-service compatible. So I think that for now the best bet is on MySQL (I don't want Oracle, anyway). I am doing some research for a DB supporting HA. At last in my quest I found that there are two solutions to support HA for MySQL: Percona XtraDB Cluster and MariaDB Galera, where the first actually uses libraries from the latter and adds some other interesting things. My question is: what is the position of Cloudera regarding backend DB in HA ? Let me to say that there's not great support for this in documentation: there are guides to make HS2 and HMS read from a HA DB, but not that much considerations and best practices. My ultimate goal is to truly make HMS and HS2 HA, adding a HA backend DB with a load-balancer on top of it, so I can: loadbalance accesses; obtain a Metastore in true HA; migrate other services such as Cloudera Manager, Hive, Impala, etc to a real always-on state; thus giving me the option to "hot-swap" services that are failing (for ex. making Hive respond even if one of the servers crush). I know that Cloudera would probably not stand for one of them over the other, but I'd like to have some recommendations (maybe they are partners of Cloudera already) or there have been some tests in past. I am interested in Percona: while Galera is in alpha state (though they says it is affordable), Percona offers support and reports some companies already using it in Production environments. I am also interested in paying support. Looking forward for your reply, thanks Omar
... View more
06-20-2016
02:41 AM
Hi, I wasn't notified of your reply. I had the opportunity to talk to a Cloudera internal, and he said that upon principal creation, CM does not append any specific option for password lenght, because it is just asking krb to generate a random password for it. So it just add -randkey. The guy says that any modificationt to the the lenght of a random password should be done on kerberos side, and it actually makes sense. Question is: how to tell kerberos or AD the default size of the random-generated password ? Can't find anything about it. I wasn't able to find anything like policy or defaults for this: there are specific options for minsize and maxsize, tough, but you have to append them when asking for principal. Bye O.
... View more
06-06-2016
05:14 AM
nope. Still waiting reply from here. By the way, how did you found out the password is 12 char long ?
... View more
05-30-2016
12:41 AM
Hi, I need to kerberize the cluster using Active Directory. I want Cloudera Manager to manage all the principals for me, so I need to create a principal for it in AD for it to create other principals needed. So far so good, but the company is asking a specific password length. So the question is: How can I tell Cloudera Manager the password length of the principal to be created in Active Directory ? What is the default configuration ? I found on this page that using /minpass and /maxpass you can set the length of the random-generated password: how can I pass something like this to CM ? Thanks Bye Omar
... View more
05-10-2016
01:22 AM
Hello, I have created a dashboard on Hue with twitter-demo collection on Cloudera Search. I am experimenting to see if I can segregate access to collections as per user name. I am able to create dashboards and in fact I see that Hue proxies the user to Solr, but on Hue I can access all the dashboard I create. Is it possible to limit access to users, provided their username or access level ? I want to find out if Hue+Search can be used for self-service BI, but I need to be able to differentiate access level. Thanks, bye Omar
... View more
01-22-2016
06:18 AM
Hi, thanks for your reply but my question was precise. For istance: show role grant group <groupname> shows me roles assigned to the group show grant role <rolename> shows me objects managed by the role I can't find a way to see roles assigned to a group, something like show roles on group <groupname> This way it would be much simplier, given a group, to know what that group is able to see. Hope this is more clear now. Bye Omar
... View more
01-15-2016
03:27 AM
Hello, On Hive with Sentry, how can I see which users or groups are assigned to a role ? I am able to see what DB a role impacts, but not what users/groups that role is assigned. Bye Omar
... View more
11-30-2015
07:44 AM
Thank you very much. Just a question: what would you prefer from those ? I suppose that using the tmp folder would be flushed upon system reboot, right ?
... View more
11-30-2015
03:20 AM
I resolved the problem on my own, I just want to point out that this strange behaviour was due to some incorrectness on data. At some point in time, partitioned data went from "table_folder/one_partition/another_partition" to "table_foldere/another_partition/one_partition" This caused the msck repair command to fail, only aligning metastore data to the latter partition type. At the moment I don't know what caused the inversion, I asked the dev team and they also don't know. By the way, fixing this problem (by recreating the table with the partition order in the correct way) let msck repair to work correctly. Bye Omar
... View more
11-30-2015
03:14 AM
Hello, do you have any update ?? Today there's a planned restart of this cluster and I'd like to apply configuration changes if needed. Thanks O.
... View more
11-27-2015
07:36 AM
Hi, I have a CDH5.4.4 cluster I can't run Hive queries from Hue, it throws this error: Fetching results ran into the following error(s): Couldn't find log associated with operation handle: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=31e3ba21-952b-4cdf-8688-c8716fcb430d] Surfing around I found this: http://hortonworks.com/community/forums/topic/hive-show-database-error/ How can I implement it on Cloudera Manager ? Is it the correct solution ? Thanks, bye Omar
... View more
11-10-2015
05:33 AM
Hello, I am experimenting with Hue, a tool long underestimated by me, and I found it really interesting. Keep doing your great work guys. I'l like to file a feature request: in the security app, please add a functionality to list the full Sentry permissions, and even a way to "bulk export" it. Let me explain. When using sentry-provider.ini, we used to keep it in a git repo. This way was very fast for our group (hadoop admins) to retreive every version we needed, from whereever we were. Also, this was useful to have an overview of the whole situation. Furthermore, we were able to add new or edit permissions at a glance, just uploading the new file. With the Sentry Service, by the way more powerful, this is no more possibile. Listing grants via CLI is not so quick and listing them via Hue is more fashion but yet, does not provide an overview. A tool capable of showing an overview (dunno, maybe as a tree view ???) would be very appreciated. I know that for backup purposes I can simply dump the Sentry DB, but implementing this kind of view would probably lead to a simple backup strategy: a special file could be exported and imported. What do you think ? My client, which holds 4 licences, would love it.
... View more
11-07-2015
12:35 AM
This is a good answer ! I work for a paying Cloudera Customer, I am experimenting new features to propose future enhancements. I was experimenting with HA HS2, and was choosing over separate HS2 servers (externally balanced via HAProxy or Netscaler) or using this way. OK, separate HS2 are not proper HA, but sort of. Since this sounded better to me, just because we are using zookeeper and not an external software, I was searching for info. Anyway, do you have any foreseen about CDH version supporting this ? Would it be in any problem with Metastore HA + Sentry service + CM ACL plugin, which is also unsupported ? I asked to the support and they offer to send me the local support agent: I don't want support for implementing it, I just want to know if it makes sense for me to investigate it, since it will supported soon. This way I can tell my client that experimentation makes sense, hope you understand what I mean. Bye Omar
... View more
11-05-2015
08:19 AM
Hi, I tried this on CDH 5.4.4 and it works. Actually, I just tried the jdbc string providing the zookeeper ensamble, I configured the CM property in advance and it just worked. Imho this property is very useful, I don't know why Cloudera is not supporting it.
... View more
11-05-2015
08:12 AM
Hello, My client is asking me a way to backup hive tables on tape. I know, this is not "big-data style". This is mandatory for them so I need to accomodate. I found out a way to do this, but the procedure implies, when restoring, this procedure: - create the table using the DDL previously backed up via "show create table" statement; - mv the files to the warehouse dir/db/table just created; - run msck repair table on that table. The command works without error, however I found out that the original table has got about 111 million records, and the target only has got 37 millions. I compared the hdfs size of the folder and they are the same. I compared the number of partitions of the table and they are the same. I tried to run msck repair once again (just in case), but the result doesn't change. So I think the problem must be in the msck command: files are in place, but somehow it skips some in fixing. What do you think ? Bye Omar
... View more
11-04-2015
12:52 AM
Thanks, will let you know. Bye
... View more
11-04-2015
12:08 AM
Yes I guessed it. So just to be super-safe, according to the last example if it finds a ojdbc6.jar in both paths, it will load the first one and discard the latter, right ?
... View more