Member since
02-07-2017
23
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
16371 | 01-23-2018 03:02 PM |
02-15-2018
03:57 PM
Why don't you use MergeContent processor to concatenate the flow-file content?
... View more
02-04-2018
05:18 AM
@Matt Krueger Your table is ACID i.e. transaction enabled. Spark doesn't support reading Hive ACID table. Take a look at SPARK-15348 and SPARK-16996
... View more
02-04-2018
04:11 AM
Could you share a bit more about your issue, like the scenario or the process that you are doing? You would get more responses that way. 🙂
... View more
01-28-2018
03:41 AM
Take a look at this guide: https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-Loadingfilesintotables You should either try INSERT INTO TABLE '${hiveconf:inputtable}' SELECT * FROM datafactory7 limit 14; or LOAD DATA INPATH '<HDFS PATH WHERE FILES LOCATED>' INTO TABLE ${hiveconf:inputtable};
... View more
01-27-2018
10:34 AM
I could see the following in the error log: ERROR: org.apache.hadoop.security.authorize.AuthorizationException: User: livy is not allowed to impersonate admin Looks like your hadoop cluster is a secure one. You need to grant livy the ability to impersonate as the originating user. You need to add two properties to core-site.xml. Take a look at this guide.
... View more
01-27-2018
10:28 AM
How different is it from using livy to do the same?
... View more
01-27-2018
10:11 AM
A small correction. It's introduced in Ranger 0.7 and policies should look like this: //HDFS
resource: path=/home/{USER}
user: {USER}
//Hive
resource: database=db_{USER}; table=*; column=*
user: {USER} where {USER} would substitute the user id of the currently logged in user.
... View more
01-24-2018
07:08 AM
Spark by default looks for files in HDFS but for some reason if you want to load file from the local filesystem, you need to prepend "file://" before the file path. So your code will be Dataset<Row> jsonTest = spark.read().json("file:///tmp/testJSON.json"); However this will be a problem when you are submitting in cluster mode since cluster mode will execute on the worker nodes. All the worker nodes are expected to have that file in that exact path so it will fail. To overcome, you can pass the file path in the --files parameter while running spark-submit which will put the file on the classpath so you can refer the file by simply calling the file name alone. For ex, if you submitted the following way: > spark-submit --master <your_master> --files /tmp/testJSON.json --deploy-mode cluster --class <main_class> <application_jar> then you can simply read the file the following way: Dataset<Row> jsonTest = spark.read().json("testJSON.json");
... View more
01-23-2018
03:02 PM
1 Kudo
Cloudera has a nice two part tuning guide. Attaching the links: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
... View more
01-23-2018
02:13 PM
Apart form JVM limitation, which you can increase, there are no
definite limitation on size or number of flowfile records as such. I would say design your flow and if you feel that you're throttled, you can follow some good designing practices to tweak your flow. Take a look at this: https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
... View more