About darouwan

darouwan · ‎10-10-2018

@Aditya Sirna So in default , there are up to 1000 lines of results stored on hdfs for each query? If I increase the limit, will it have some negative effects? Such as slow http transferring? Or result receiving failed?

darouwan · ‎10-10-2018

@Aditya Sirna Thanks Aditya So what about paging? Since the whole results are saved on hdfs in JSON format, if I need to load part of whole result, just load the whole json file and cut out part of it by given page size and page number in memory ? In practice for zeppelin, will it have out of memory problem if the size is too huge?

darouwan · ‎10-10-2018

I am working on designing a hdfs query system based on spark, which containing a paging function, and zeppelin seems be a good sample for me. Now I have a problem. I see spark or spark sql query results are existed even I refresh or reopen the notebook. So the results must be saved on some place. So I am wondering where these result data is saved on? If the data is saved on database, what if the result data size is pretty huge so that causing the database performance problem?

darouwan · ‎08-16-2018

Hi @Jonathan Sneep Fine. thanks. I have added the user and group info to my namenode. So the typical way to adding the new user or group is creating the user and group on namenode, and waiting for usersync to sync the user info to Ranger? So if I don't care the group policy, creating internal user in ranger and specifying them in allow conditions also works? At least it seems work in practice..

darouwan · ‎08-16-2018

Hi @Jonathan Sneep Not yet. So I need to add the user the related group on my namenode host manually?

darouwan · ‎08-15-2018

Hi, @Jonathan Sneep Thanks for your response. Actually both user and group are created in Ranger, which are internal for ranger

darouwan · ‎08-15-2018

I meet some problem in ranger authentication. Here is my step to represent it: 1. I create one account in ranger, where the username is test01 2. I set it belong to a group test_group01 3. In the ranger hdfs policy, I set test_group01 has access to the directory /data/ If it runs normally, the test01 user should have the access to /data/ from the privilege inheriting from the group "test_group01". But in practice, it cannot access the directory /data. However if I specify the 'select user' with the test01, it works well. So it seems that specifying the group in policy doesn't work, and specifying the permitted user is fine. How to solve it? Thanks!

darouwan · ‎07-12-2018

Thanks Jay. I checked curl and libcurl version by running "yum list | grep curl", their version is . curl.x86_64 7.19.7-46.el6 libcurl.x86_64 7.19.7-46.el6 python-pycurl.x86_64 7.19.0-8.el6 libcurl.i686 7.19.7-46.el6 libcurl-devel.i686 7.19.7-46.el6 libcurl-devel.x86_64 7.19.7-46.el6 curl -V prints the following info: curl 7.19.7(x86-64-redhat-linux-gnu) libcurl/7.19.7... Protocals:... Features: GSS-Negotiate ... If I run the alert_spark2_livy_port.py script independently, it runs well What confuse me is, all my three hosts have the complete same version of curl, but only one have the above problem.

darouwan · ‎07-12-2018

The spark livy alert always reports an alert : Connection failed on host ***:8999 In detail, it prints ExecutionFailed: Execution of 'curl -s o /dev/null -w'%{http_code} --negotiate -u: -k http://host:8999/session | grep 200' return 1, curl: option --negotiate: the installed liburl version doesn't support this curl: try curl --help... I have 3 host in this cluster, but only one host report this alert. I have checked the curl version and libcurl on hosts respectively, and they are all same. It may caused by installing anaconda and python version changing, but I am not sure as default python version is 2.6. How to fix it? Thanks!

darouwan · ‎04-03-2018

I am trying to read data from kafka and writing them in parquet format via Spark Streaming. The problem is, the data from kafka are in variable data structure. For example, app one has columns A,B,C, app two has columns B,C,D. So the data frame I read from kafka has all columns ABCD. When I decide to write the dataframe to parquet file partitioned with app name, the parquet file of app one also contains columns D, where the columns D is empty and it contains no data actually. So how to filter the empty columns when I writing dataframe to parquet? Thanks!

Online	Offline
Last Visited	‎04-02-2018 12:02 AM

Member Since	‎12-21-2017 12:43 AM
Last Visited	‎04-02-2018 12:02 AM
Posts	67
Kudos received	3

Cloudera Community

Re: Access file on hdfs via proxy

Re: NullPointerException when running spark on Zep...

Re: How does zeppelin storage query result?

Re: How does zeppelin storage query result?

How does zeppelin storage query result?

Re: Ranger auth error in hdfs

Re: Ranger auth error in hdfs

Re: Ranger auth error in hdfs

Ranger auth error in hdfs

Re: "installed libcurl version doesn't support thi...

"installed libcurl version doesn't support this cu...

How to delete empty columns in df when writing to ...