About nikkie_thomas

nikkie_thomas · ‎06-09-2017

Thanks for your response.If I partition the data by yyyy-mm-dd field and I receive only one file per day. I assume , I will always have one file per partition irrespective of this setting? If the above assumption is correct(pls correct if that is wrong), will I end up with select queries which runs slower if I store files for say 6 years? i.e I will have 6 * 365 files each around say 8-9MB in size (which is smaller than the default chunk size). I was hoping to consolidate the files on a weekly basis but , I need to have data available to users on a daily basis ..hence I dont think I can do that. Let me know the suggestions. Thanks Nikkie

nikkie_thomas · ‎06-09-2017

Hi , I have large csv files which arrives Hadoop on a daily basis.(10GB). 1 file per day. I have a Hive external table and point it to the files (No partitions / No ORC) - Table1. I have another table Table2(external table + ORC-ZLIB) partitioned by date(yyyy-mm-dd) loaded from Table1 using insert into Table2 partition(columnname) select * from Table1 with hive.exec.dynamic.partition = true enabled. The daily files once compressed via ORC comes to <10MB(this was a surprise to me looking at the compression ratio). I have read about the multiple small file problems in Hadoop from the HW community. Is there any additional settings in Hive / considerations to be in place so that we don't run into performance issues caused by the multiple small files? Thanks Nikkie

nikkie_thomas · ‎05-23-2017

Hi, I think I figured it out myself. Here is what I did curl -u admin:admin -X GET http://127.0.0.1:6080/service/public/v2/api/servicedef/name/hive >> test.out Opened the test.out file. The mistake I did previously was I sent only the policyconditions in the file. Now the full output from the GET with the udpated policycondition is sent back. Updated the policycondition section which was blank in the response of GET as below(bold) { "id": 3, "guid": "3e1afb5a-184a-4e82-9d9c-87a5cacc243c", "isEnabled": true, "createTime": 1477381370000, "updateTime": 1477381412000, "version": 2, "name": "hive", "implClass": "org.apache.ranger.services.hive.RangerServiceHive", "label": "Hive Server2", "description": "Hive Server2", "options": {}, "configs": [ { "itemId": 1, "name": "username", "type": "string", "mandatory": true, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Username" }, { "itemId": 2, "name": "password", "type": "password", "mandatory": true, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Password" }, { "itemId": 3, "name": "jdbc.driverClassName", "type": "string", "mandatory": true, "defaultValue": "org.apache.hive.jdbc.HiveDriver", "validationRegEx": "", "validationMessage": "", "uiHint": "" }, { "itemId": 4, "name": "jdbc.url", "type": "string", "mandatory": true, "defaultValue": "", "validationRegEx": "", "validationMessage": "", "uiHint": "" }, { "itemId": 5, "name": "commonNameForCertificate", "type": "string", "mandatory": false, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Common Name for Certificate" } ], "resources": [ { "itemId": 1, "name": "database", "type": "string", "level": 10, "mandatory": true, "lookupSupported": true, "recursiveSupported": false, "excludesSupported": true, "matcher": "org.apache.ranger.plugin.resourcematcher.RangerDefaultResourceMatcher", "matcherOptions": { "wildCard": "true", "ignoreCase": "true" }, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Hive Database", "description": "Hive Database" }, { "itemId": 2, "name": "table", "type": "string", "level": 20, "parent": "database", "mandatory": true, "lookupSupported": true, "recursiveSupported": false, "excludesSupported": true, "matcher": "org.apache.ranger.plugin.resourcematcher.RangerDefaultResourceMatcher", "matcherOptions": { "wildCard": "true", "ignoreCase": "true" }, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Hive Table", "description": "Hive Table" }, { "itemId": 3, "name": "udf", "type": "string", "level": 20, "parent": "database", "mandatory": true, "lookupSupported": true, "recursiveSupported": false, "excludesSupported": true, "matcher": "org.apache.ranger.plugin.resourcematcher.RangerDefaultResourceMatcher", "matcherOptions": { "wildCard": "true", "ignoreCase": "true" }, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Hive UDF", "description": "Hive UDF" }, { "itemId": 4, "name": "column", "type": "string", "level": 30, "parent": "table", "mandatory": true, "lookupSupported": true, "recursiveSupported": false, "excludesSupported": true, "matcher": "org.apache.ranger.plugin.resourcematcher.RangerDefaultResourceMatcher", "matcherOptions": { "wildCard": "true", "ignoreCase": "true" }, "validationRegEx": "", "validationMessage": "", "uiHint": "", "label": "Hive Column", "description": "Hive Column" } ], "accessTypes": [ { "itemId": 1, "name": "select", "label": "select", "impliedGrants": [] }, { "itemId": 2, "name": "update", "label": "update", "impliedGrants": [] }, { "itemId": 3, "name": "create", "label": "Create", "impliedGrants": [] }, { "itemId": 4, "name": "drop", "label": "Drop", "impliedGrants": [] }, { "itemId": 5, "name": "alter", "label": "Alter", "impliedGrants": [] }, { "itemId": 6, "name": "index", "label": "Index", "impliedGrants": [] }, { "itemId": 7, "name": "lock", "label": "Lock", "impliedGrants": [] }, { "itemId": 8, "name": "all", "label": "All", "impliedGrants": [ "select", "update", "create", "drop", "alter", "index", "lock" ] } ], "policyConditions": [ { "itemId": 1, "name": "resources-accessed-together", "evaluator": "org.apache.ranger.plugin.conditionevaluator.RangerHiveResourcesAccessedTogetherCondition", "evaluatorOptions": {}, "label": "Resources Accessed Together?", "description": "Resources Accessed Together?" }], "contextEnrichers": [], <Deleted the remaninig> ] } } Saved the file as hiveService3.json Executed the below command. curl -v -H 'Content-Type: application/json' -u admin:admin -X PUT --data @hiveService3.json http://127.0.0.1:6080/service/public/v2/api/servicedef/name/hive * About to connect() to 127.0.0.1 port 6080 (#0) * Trying 127.0.0.1... connected * Connected to 127.0.0.1 (127.0.0.1) port 6080 (#0) * Server auth using Basic with user 'admin' > PUT /service/public/v2/api/servicedef/name/hive HTTP/1.1 > Authorization: Basic YWRtaW46YWRtaW4= > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.21 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2 > Host: 127.0.0.1:6080 > Accept: */* > Content-Type: application/json > Content-Length: 10161 > Expect: 100-continue > < HTTP/1.1 100 Continue < HTTP/1.1 200 OK < Server: Apache-Coyote/1.1 < Set-Cookie: RANGERADMINSESSIONID=E0EA0005D86487C03AB4A3C2129E3A97; Path=/; HttpOnly < X-Frame-Options: DENY < Content-Type: application/json < Transfer-Encoding: chunked < Date: Tue, 23 May 2017 12:05:34 GMT < Connection #0 to host 127.0.0.1 left intact * Closing connection #0 It seems to be success as when I clicked on the Add Condition from Ranger the condition ResourcesAccessedTogether Came?: I still have to proceed with the next steps. I will let you know. Thanks Nikkie

nikkie_thomas · ‎05-23-2017

Hi, I was trying to work through the example in HDP 2.5 sandbox. I created a file named hiveService2.json with the following content. "policyConditions": [ { "itemId": 1, "name": "resources-accessed-together", "evaluator": "org.apache.ranger.plugin.conditionevaluator.RangerHiveResourcesAccessedTogetherCondition", "evaluatorOptions": {}, "label": "Resources Accessed Together?", "description": "Resources Accessed Together?" } ] Connected to the sandbox via putty. The file is placed at /root/hiveService2.json Executed the following command from /root curl -v -H 'Content-Type: application/json' -u admin:admin -X PUT --data @hiveService2.json http://127.0.0.1:6080/service/public/v2/api/servicedef/name/hive I am getting the below About to connect() to 127.0.0.1 port 6080 (#0) * Trying 127.0.0.1... connected * Connected to 127.0.0.1 (127.0.0.1) port 6080 (#0) * Server auth using Basic with user 'admin' > PUT /service/public/v2/api/servicedef/name/hive HTTP/1.1 > Authorization: Basic YWRtaW46YWRtaW4= > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.21 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2 > Host: 127.0.0.1:6080 > Accept: */* > Content-Type: application/json > Content-Length: 325 > < HTTP/1.1 404 Not Found < Server: Apache-Coyote/1.1 < Set-Cookie: RANGERADMINSESSIONID=4ABEDBD3646557C69F985A11BF7DDE19; Path=/; HttpOnly < X-Frame-Options: DENY < Content-Length: 0 < Date: Tue, 23 May 2017 11:30:40 GMT < * Connection #0 to host 127.0.0.1 left intact * Closing connection #0 Could you please help me 1)Correct the content of the file(hiveService2.json)- if it is not correct 2)Get around the 404 Not Found. Thanks Nikkie

nikkie_thomas · ‎03-30-2017

Ok.. so my understanding was correct. We wouldn't need to have R installed on all the cluster nodes. The client node with RStudioServer only needs it(and it will definitely have it) ?

nikkie_thomas · ‎03-30-2017

@Divakar Annapureddy Thanks for the very quick response. I will try that out. Do you have any insights into the question 1 in the orginal post?

nikkie_thomas · ‎03-30-2017

@Divakar Annapureddy Thanks for your response. We are using the licensed version RStudioServer Pro. Can you throw some light on the questions posted?

nikkie_thomas · ‎03-30-2017

Hi, I am beginner in this area. So the question might be very basic. We have a HDP 2.4 cluster with Kerbros enabled.We are planning to setup a Client node outside our cluster with RStudioServer installed on it.The objective is to use R in combination with Spark (in the cluster). The analyst writes R code from RStudio and use libraries like sparkR or sparklyr to connect to HDP Cluster. On reading the Hortonworks documentation , it says as a pre-requisite to have R installed on all nodes. The confusing part for me is : 1)I was under the impression that R is required on all nodes in my cluster if you plan to use sparkR from the HDP cluster nodes.( ./bin/sparkR) Is that the same case when we use RStudio from a client node? 2)Assume we use the below sc <- spark_connect(master = "spark://IPaddress:port" ...) to connect from RStudio to the cluster. How do I authenticate to my kerberised cluster from RStudio code? Thanks NThomas

Online	Offline
Last Visited	‎06-09-2017 11:52 AM

Member Since	‎03-30-2017 05:22 AM
Last Visited	‎06-09-2017 11:52 AM
Posts	8
Kudos received	1

Cloudera Community

Re: Hive Multiple Small Files

Hive Multiple Small Files

Re: How to Create a Ranger Policy that Prohibits C...

Re: How to Create a Ranger Policy that Prohibits C...

Re: RStudio to HDP Cluster 2.4

Re: RStudio to HDP Cluster 2.4

Re: RStudio to HDP Cluster 2.4

RStudio to HDP Cluster 2.4