Member since
06-17-2016
56
Posts
5
Kudos Received
0
Solutions
07-27-2019
05:16 AM
It worked. Thanks!
... View more
05-29-2018
09:09 PM
Hi guys, thanks so much for the fast support and thanks to the Matts Team @Matt Burgess and @Matt Clarke I finally understood how the processor works. He emits a flow file with no payload and in the meta attributes are the file details like path and filename. Those are used by the HDFSFetch to fetch the correspondent files. Kind regards, Paul
... View more
05-28-2018
01:03 PM
Hi everyone, I already solved it after a deep analysis of the code. As you can see in the code I posted above, I am repartitioning the data. As a background, the regular process transforms small files, and I want to collect the partial results and created a sigle file, which is then written into HDFS. That is a desired feature since HDFS works better with bigger files. To explain it better, because small and big could be very fuzzy. Our HDFS has a standard configuration of 128 MB blocks, therefore, a 2 or 3 MB files makes no sense and is also affecting the performance. This is the regular situation, but now a backlog of around 1 TB needs to be processed and the repartition is causing a shuffle operation. As far as I understand, the repartition requires to collect all the parts in one worker to create one partition. Since the original RDD is bigger than the memory available in the workers, this collapses everything and throws the errors I reported above. aswdirCsvDf.repartition(1).write I just removed the ".repartition(1)" from the code and now is everything working. The program, writes several files, that is, one file pro worker, and in this context it is quite ok. Kind regards, Paul
... View more
05-31-2018
09:34 AM
@Felix Albani Hi felix, you installed 3.6.4, but according to the document spark2 can only support up to 3.4.x, Can you kindly explain how does this work ?
... View more
06-20-2018
04:49 PM
@Paul Hernandez Hey Paul - did you find a solution to this? It looks like its only parquet thats affected..csv doesnt have this problem. I too have data in subdirectories and spark sql returns null
... View more
03-15-2018
09:02 AM
Hi @Patrick Young You need to follow many steps to make this works. About python: - I installed anaconda3 and the critical step is do not let anaconda3 to be configured in the environment variables. HDP platform needs python 2 for some scripts and the python path needs to be resolved to a python 2 installation. Since I want to have spark and spark2 interpreters I commented the SPARK_HOME line in the zeppelin-env.sh file: Another configuration I changed in this file: According
to the documentation, the variable ZEPPELIN_JAVA_OPTS changed in spark2 to
ZEPPELIN_INTP_JAVA_OPTS. Since both versions are active these two variables are
defined: exportZEPPELIN_JAVA_OPTS="-Dhdp.version=None
-Dspark.executor.memory=512m -Dspark.executor.instances=2
-Dspark.yarn.queue=default" export ZEPPELIN_INTP_JAVA_OPTS="-Dhdp.version=None
-Dspark.executor.memory=512m -Dspark.executor.instances=2
-Dspark.yarn.queue=default" - You need to configure the spark2 interpreter as follow: I also created a Python interpreter: Finally I created a symbolic link to be able to find conda Create symlink to /bin/conda: ln -s /opt/anaconda3/bin/conda /bin/conda Of course you have to adjust the paths above to your paths. Hope that helps. Kind regards, Paul
... View more
03-06-2018
08:49 PM
1 Kudo
The performance has been very good especially considering it can be on any size node or cluster. Give it a try, upgrade to NiFi 1.5. Certainly easier then recompiling Spark programs. If it doesn't meet your needs go back to Spark.
... View more
02-26-2018
08:07 AM
Hi @Aditya Sirna, Thanks for your answer. Independent of the Hive-Phoenix platform support I would like to know how to centrally add hive aux libs. I am able to access the phoenix tables from the hive cli in two ways: Running a "set hive.aux.jars.path=<jar location" in the cli Adding a auxlib folder to hive The problem with these 2 approaches is, the library is only available to the node or box where the cli is running. It does not work on hive ambari views or other clients/boxes. Customization of the jinja template for hive-env seems to be extrem complicated for me and I will find a system engineer to do that. I don't really know if it worth and we can just access the phoenix tables directly without hive. Kind regards, Paul
... View more
11-25-2017
07:19 AM
Hi @enzo EL 1) If you just need pandas with pyspark, just test it with the example I provided for the spark interpreter 2) It seems like the python interpreter is first available with Zeppelin 0.7.2. Is an upgrade possible for you? 3) You can add non available interpreters following the official documentation: https://zeppelin.apache.org/docs/0.7.0/manual/interpreterinstallation.html I have never done it before but it should work.
... View more
02-01-2018
11:13 AM
Hi @Krishnaswami Rajagopalan I don't know exactly the detail of this Sandbox in the Azure Cloud. Are you connecting to the Sandbox or to the docker container inside? The docker container is where zeppelin and other services are located. To connect to the docker container use the port 2222 in your SSH command. Example: ssh root@127.0.0.1 -p 2222 It doesn't matter I guess if your cluster or sandbox is running on the cloud. You should be able to find zeppelin under /usr/hdp/current/zeppelin-server Hope this helps. BR. Paul
... View more
11-15-2017
09:41 AM
Hi Jay, thanks for the quick answer. I installed without Ambari. The reason: I want to use Spark2 which is not working with Zeppelin 0.6. I need Zeppelin 0.7.x which is not working with Ambari 2.4. I cannot or at least I don't want at this moment to upgrade our HDP 2.5.3 In this chain of dependencies I found the best solution to install a non Ambari managed Zeppelin. When I started Zeppelin in this way: su zeppelin
/opt/zeppelin/bin/zeppelin-daemon.sh start I got no errors. When I navigate to http://myhost:9995 I got an HTTP 403 - forbidden What I saw, is the directory /opt/zeppelin/webapps is empty. If I do the same as root the directory above is populated /opt/zeppelin/Webapps/webapp/..... Therefore I guess is a permission problem, but I already set the permissions as follows: chown -R zeppelin:zeppelin /opt/zeppelin chmod -R 775 /opt/zeppelin
... View more
10-05-2018
09:17 AM
@slachterman I
am facing some issues with PySpark code and some places i see there are
compatibility issues so i wanted to check if that is probably the
issue. Even otherwise it is better to check these compatibility problems
upfraont i guess. So i wanted to know some things. I am on 2.3.1
spark and 3.6.5 python, do we know if there is a compatibility issue
with these? Do i upgrade to 3.7.0 (which i am planning) or downgrade to
<3.6? What in your opinion is more sensible? Info: versions.. Spark --> spark-2.3.1-bin-hadoop2.7.. all installed according to instructions in python spark course venkatesh@venkatesh-VirtualBox:~$ java -version</li><li>openjdk version "10.0.1"2018-04-17</li><li>OpenJDKRuntimeEnvironment(build 10.0.1+10-Ubuntu-3ubuntu1)</li><li>OpenJDK64-BitServer VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)</li></ol> I work MacOS and Linux.
... View more
12-26-2017
11:09 PM
Hi @vromeo I raised this question on stack overflow and received an acceptable answer : https://stackoverflow.com/questions/47198678/zeppelin-python-conda-and-python-sql-interpreters-do-not-work-without-adding-a Kind regards, Paul
... View more
11-06-2017
08:20 PM
1 Kudo
Hi everyone, I already found a solution. I was sending just
strings to the angular controller (not a json object). In the controller an angular.Foreach is used to iterate the
incoming value “newValue”. This
function iterates both objects and arrays. What I sent was interpreted as
an element of a strings array and not as Json object, i.e. {"value":"-2.5119123","loc":{"lat":53.3,"lon":7.125}} In the first iteration the value evaluated
was: "value":"-2.5119123" And in the second: "loc":{"lat":53.3,"lon":7.125} I just modified the scala code in order to
send the whole Array[String]: %spark2
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
val COSMODE_Wind = sqlContext.sql("SELECT lat0, long0, value FROM dev_sdsp.cosmode_single_level_elements_v_10m limit 100")
case class Loc(lat: Double, lon: Double)
case class Wind(value: String, loc: Loc)
val dataPoints = COSMODE_Wind.map{s => Wind(s.getDouble(2).toString, Loc( s.getDouble(0), s.getDouble(1)))}
val dataPointsJson = dataPoints.toJSON.take(100)
z.angularBind("locations", dataPointsJson)
Then the angular.Foreach iterate the whole “row” {"value":"-2.5119123","loc":{"lat":53.3,"lon":7.125}} Wich can be converted to a json object
using javascript Finally, the values are accessible using dot
notation. Here the Angular snippet: var el = angular.element($('#map').parent('.ng-scope'));
angular.element(el).ready(function() {
window.locationWatcher = el.scope().compiledScope.$watch('locations', function(newValue, oldValue) {
// geoMarkers.clearLayers(); -- if you want to only show new data clear the layer first
console.log("new value: " + newValue);
angular.forEach(newValue, function(wind) {
try { JSON.parse(wind); } catch(error) { alert(error); }
windJSON = JSON.parse(wind)
var marker = L.marker([windJSON.loc.lat, windJSON.loc.lon]).bindPopup(windJSON.value).addTo(geoMarkers);
});
})
});
Hope ths helps someone.
... View more
02-20-2017
01:35 PM
Hi everyone,
I don't exactly what I modified in Ranger, but now I am able to open the Hive View, however, I'm still getting an error:
Error while compiling statement: FAILED: SemanticException MetaException(message:java.security.AccessControlException: Permission denied: user=hive, access=READ, inode="/apps/hive/warehouse/myfile":anuser:agroup:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) ...
I modified this property: hive.server2.enable.doAs either to true and to false with the same result. What I cannot understand, is why the user is always hive According to this article: http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ if the property is set to true Hiveserver2 will run MR jobs in HDFS as the original user. Why the original user is also hive. Is it may be realted to these properties? hadoop.proxyuser.hive.hosts hadoop.proxyuser.hive.groups Any comment will be appreciated.
... View more
04-01-2017
05:41 AM
@Vipin Rathor Great work !
... View more
03-06-2017
07:29 PM
Hi everyone, @jwhitmore Thank's for your response, you are right, when exposing 50010, Talend For Big Data works (with component tHDFSConnect and cie) But, even if we exposing the 50010 port, there are always the same error when using Talend ESB with Camel Framework, see below : [WARN ]: org.apache.hadoop.hdfs.DFSClient - DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/Ztest.csv.opened could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122) I've design a Scala program, and i'm facing the same issue : 15:59:22.386 [main] ERROR org.apache.hadoop.hdfs.DFSClient - Failed to close inode 500495org.apache.hadoop.ipc.RemoteException: File /user/hdfs/testscala2.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122) Any idea ? Thank's in advance. Best regards, Mickaël.
... View more