Member since
11-07-2016
637
Posts
253
Kudos Received
144
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2286 | 12-06-2018 12:25 PM | |
2341 | 11-27-2018 06:00 PM | |
1812 | 11-22-2018 03:42 PM | |
2881 | 11-20-2018 02:00 PM | |
5228 | 11-19-2018 03:24 PM |
06-20-2018
07:21 AM
@Saurabh Ambre, Glad to know that the previous issue is resolved. It is always good to create a separate thread for each issue. Please create a new issue for this issue so that the main thread doesn't get deviated. Also in the new question , put the complete stack trace and attach the pom.xml file. Feel free to tag me in the question. Please Accept the above answer.
... View more
06-19-2018
02:46 PM
1 Kudo
@Saurabh Ambre, Try adding these below 2 lines and see if it works. Configuration conf = new Configuration();
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName()); . Please "Accept" the answer if this works. This will be helpful for other community users. . -Aditya
... View more
06-19-2018
04:52 AM
@Anpan K, HDFS audit is different from ranger hdfs audit. Ranger HDFS audit works only when "Ranger HDFS" plugin is enabled. In case you want audits when Ranger HDFS plugin is not enabled, then you use hdfs audits. If ranger hdfs plugin is enabled, then you use ranger hdfs audits. Ranger audits are stored in HDFS and also can be stored in Solr. To put audits to Solr, enable "Audits to Solr" plugin from Ambari. You can see the logs in Ranger UI if the audits are logged to Solr , it cannot read audits from HDFS. . -Aditya
... View more
06-19-2018
04:17 AM
2 Kudos
@Sandeep Ahuja, textFile() partitions based on the number of HDFS blocks the file uses. If the file is only 1 block, then RDD is initialized with minimum of 2 partitions. If you want to increase the minimum no of partitions then you can pass an argument for it like below files = sc.textfile("hdfs://user/cloudera/csvfiles",minPartitions=10) If you want to check the no of partitions, you can run the below statement files.getNumPartitions() Note: If you set the minPartitions to less than the no of HDFS blocks, spark will automatically set the min partitions to the no of hdfs blocks and doesn't give any error. . Please "Accept" the answer if this helps or revert back for any questions. . -Aditya
... View more
06-15-2018
02:46 PM
@Sayantan Dash, This is just a Warning message and shouldn't be the problem. Can you check if there are some other error logs.
... View more
06-15-2018
11:58 AM
1 Kudo
@JAy PaTel, You cannot directly write the output of echo to hdfs file. Instead you can do like below echo "`date` hi" > /tmp/output ; hdfs dfs -appendToFile /tmp/output /tmp/abc.txt . -Aditya
... View more
06-15-2018
10:56 AM
1 Kudo
@Alex Witte, According to your question, you want to transform it to the below format Col1 Col2
1 [agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2 [agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount] I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
my_list = my_list.split(",")
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show() >>> from pyspark.sql.types import *
>>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
>>> final_struc = StructType(fields=data_schema)
>>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
>>> df.show()
+---+--------------------+
| id| route|
+---+--------------------+
| 1|agakhanpark,scien...|
| 2|agakhanpark,wynfo...|
+---+--------------------+
>>>
>>>
>>> from pyspark.sql.functions import udf
>>> def str_to_arr(my_list):
... my_list = my_list.split(",")
... return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
>>> str_to_arr_udf = udf(str_to_arr,StringType())
>>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
>>> df = df.drop("route")
>>> df.show()
+---+--------------------+
| id| route_arr|
+---+--------------------+
| 1|[agakhanpark,scie...|
| 2|[agakhanpark,wynf...|
+---+--------------------+ . Please "Accept" the answer if this helps. . -Aditya
... View more
06-13-2018
04:01 PM
@Basil Paul, Looks like your Namenode is not running. Start the NameNode first and then try starting HBase master. Also, it is better to give proper hostnames instead of hbase1, hbase2 etc. . Please "Accept" the answer If this helps you. This will be really useful for other community users -Aditya
... View more
06-11-2018
04:00 PM
1 Kudo
@Sami Ahmad, When you run "select count(*) from emp" , the size of rst (ResultSet) will be only 1. So rst.getString(2) will give IndexOutOfBoundsException. Remove rst.getString(2) when you run select count(*) from emp and it will work properly. . Please "Accept" the answer if this works for you. . -Aditya
... View more
05-29-2018
01:23 PM
1 Kudo
@bigdata.neophyte, Yes. It is possible to install the cluster without services like HDFS, Yarn , MR etc. However Ambari recommends you to install SmartSense and Ambari Metrics which you can delete after installation or use blueprints to install the cluster. . -Aditya
... View more