About vzlatkin

vzlatkin · ‎01-09-2016

After reading the Spark documentation and source code, I can find two ways to reference an external configuration file inside of a Spark (v1.4.1) job, but I'm unable to get either one of them to work. Method 1: from Spark documentation says to use ./bin/spark-submit --files /tmp/test_file.txt, but doesn't specify how to retrieve that file inside of a Spark job written in Java. I see it being added, but I don't see any configuration parameter in Java that will point me to the destination directory INFO Client: Uploading resource file:/tmp/test_file.txt -> hdfs://sandbox.hortonworks.com:8020/user/guest/.sparkStaging/application_1452310382039_0019/test_file.txt Method 2: from Spark source code suggests to use SparkContext.addFile(...) and SparkContext.textFile(SparkFiles.get(...)), but that doesn't work either as that directory doesn't exist in HDFS (only locally). I see this in the output of spark-submit --master yarn-client 16/01/09 07:10:09 INFO Utils: Copying /tmp/test_file.txt to /tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt 16/01/09 07:10:09 INFO SparkContext: Added file /tmp/test_file.txt at http://192.168.1.13:39397/files/test_file.txt with timestamp 1452323409690 . . 16/01/09 07:10:17 INFO SparkContext: Created broadcast 5 from textFile at Main.java:72 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt

vzlatkin · ‎01-07-2016

@Simon Elliston Ball I think you are suggesting option #4; would you provide more detail of how that might work? If I understand the suggestion correctly, this would be the pseudo-code: JavaPairRDD<String, Float> allSamples; for (fileName in fileNames) { JavaRDD<String> file = sc.textFile(fileName); JavaRDD<String> sample = file.sample(true, 1,000,000 / file.count()); JavaPairRDD<String, Float> fileToSample = sample.mapToPair(x -> { Float importantElement = /* extract from line */ return new Tuple2<>(fileName , importantElement); }); allSamples.union(fileToSample); } At the end of this the allSamples RDD will be a 2D matrix with the rows representing each file (~100 rows) and the columns representing each iteration (~1M columns). How do I perform aggregation to sum all elements in each column?

vzlatkin · ‎01-06-2016

I'm writing an iterative Spark application that needs to read a random line from hundreds of files and then aggregate the data in each iteration. The number of files is small: ~100, and each one is small in size: <1MB, but both will grow in the future. Each file has the exact same CSV schema and all of them live in the same directory in HDFS. In pseudo-code, the application would look like this: for each trial in 1..1,000,000: val total = 0 for file in all files: val = open file and read random line total += val done return total done I see the following possibilities to accomplish this in Spark: Execute ~1M iterations and in each one open ~100 files, read one line, and perform aggregation (the approach in pseudo-code above). This is simple, but very I/O intensive because there will be 1M * 100 calls to open a file in HDFS. Place the contents of each of the ~100 files into memory in the driver program and then broadcast that to each of the ~1M iterations. Each iteration would read a random line from ~100 in-memory objects, then aggregate the results. This is better, but each object has to be serialized and transferred over the network from the driver program. Create an external Hive table and in each iteration execute select queries to fetch a random row, then aggregate the results. Execute ~100 iterations and in each one open a single file and read ~1M lines from it at random. Each iteration would return an list of values ~1M long and all of the aggregation would be performed in the driver program. What is the best approach?

vzlatkin · ‎01-05-2016

You can do something like this: hdfs dfs -put *filename*[0-9].txt /tmp For example: $ touch ${RANDOM}filename-$(date +"%F").txt ${RANDOM}filename.txt $ ls *filename*.txt 17558filename-2016-01-05.txt 27880filename.txt $ hdfs dfs -put *filename*[0-9].txt /tmp $ hdfs dfs -ls /tmp -rw-r--r-- 3 hdfs hdfs 0 2016-01-05 16:39 /tmp/17558filename-2016-01-05.txt If that doesn't work, add this to the beginning: set +f

vzlatkin · ‎01-05-2016

That worked! I was missing the extra HDP Jetty repo: <url>http://repo.hortonworks.com/content/repositories/jetty-hadoop/</url>

vzlatkin · ‎01-04-2016

Because that doesn't work as I'm not trying to build the hadoop-auth package. Setting '-Djetty.version=6.1.26' does nothing since that variable isn't used anywhere in my Spark application.

vzlatkin · ‎01-04-2016

I'm trying to build a simple Spark Java application that pulls its dependencies from the HDP Releases repository. My project only depends on: <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.2.2.3.4.0-3485</version> <scope>provided</scope> </dependency> Through a complex web of dependencies jetty-util version 6.1.26.hwx is required, but it is not found in any publicly visible repository. Where can I find this dependency so that I can build a Spark application that uses Hortonworks packaged jars? Is it best to exclude the jetty-util dependency from spark-core_2.10 and then explicitly add it to the project with a non-HWX version? (This question is similar to https://community.hortonworks.com/questions/6256/j... but the solution posted there does not work for my scenario.)

vzlatkin · ‎01-03-2016

Is there a Maven Archetype that does the following? Setup HDP Maven repository Create Maven profiles to download different versions of Spark and Hadoop jars based on HDP version Setup for building an uber jar Add a sample Java application and test case to the project Add winutils binaries to run test-cases on Windows I found this: https://github.com/spark-in-action/scala-archetype-sparkinaction, which does #4, but for Scala.

vzlatkin · ‎12-31-2015

This Oozie log doesn't contain the details of why shell action failed. To get those details, look at the YARN job logs for this action. Click on "YARN" in Ambari, then on "Resource Manager UI" under "Quick Links" and then look at the logs for the application which contains the Oozie workflow ID in it. My guess is that there was a permission problem when writing e-mail content.

vzlatkin · ‎12-30-2015

Visualize near-real-time stock price changes using Solr and Banana UI The goal of this tutorial is to create a moving chart that shows the changes in price of a few stock symbols, similar to Google Finance or Yahoo Finance. Summary of steps Download and install the HDP Sandbox Download and install the latest NiFi release Create a Solr dashboard to visualize the results Create a new NiFi flow to pull from Google Finance API, transform, and store in HBase and Solr Step-by-step 1. Download and install the HDP Sandbox Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below: 192.168.56.102 sandbox sandbox.hortonworks.com 2. Download and install the latest NiFi release Follow the directions here. These were the steps that I executed for 0.4.1 cd /tmp wget http://apache.cs.utah.edu/nifi/0.4.1/nifi-0.4.1-bin.zip cd /opt/ unzip /tmp/nifi-0.4.1-bin.zip useradd nifi chown -R nifi:nifi /opt/nifi-0.4.1/ perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.4.1/conf/bootstrap.conf perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.4.1/conf/nifi.properties /opt/nifi-0.4.1/bin/nifi.sh start 3. Create a Solr dashboard to visualize the results Download a new Solr dashboard, start the service, and create a new collection to store stock price changes: export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 wget https://raw.githubusercontent.com/vzlatkin/Stocks2HBaseAndSolr/master/Solr%20Dashboard.json -O /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json /opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181 /opt/hostname-hdpsearch/solr/bin/solr create -c stocks -d data_driven_schema_configs -s 1 -rf 1 4. Create a new NiFi flow to pull from Google Finance API, transform, and store in HBase and Solr Solr is used for indexing the data, Banana UI is used for visualization, and HBase is used for future-proofing. HBase can be used to further analyze the data from Storm/Spark or to create a custom UI. The get the data into these tools, follow the steps below: Start HBase via Ambari Create a new table: hbase shell hbase(main):001:0> create 'stocks', 'cf' Then download this NiFi template to your host machine. To import the template, open the NiFi UI Open Templates manager: Find the template on your local machine and import it: Drag and drop to instantiate a new template: Double click the new process group: You'll need to enable the HBase shared controller. To do so, click the right mouse button over the "Send to HBase" process, then click "Configure", then "Properties" and the "Go to" arrow to access the controller. Finally, click the "Enable" button. Now start all of the processes. Hold down the Shift-key, and select all of the processes on the screen. Then click the start button: You should see a flow that looks like the below screenshot The reason for so many processes is that the response from Google Finance API needs to be transformed. First, we remove the comment characters '//' from the response. Second, we split the array into individual JSON objects. Third, we extract the relevant attributes. Fourth, the timestamp has the format of UTC, but it is actually in EST timezone, therefore, we fix that. Finally, we send the information to HBase, Solr, and the NiFi bulletin board for logging. Conclusion Now open the Banana UI. If you are doing this when the US stock markets are open (9:30am to 4pm Eastern Time), then you should see a dashboard similar to the below. Full source code is available in GitHub.

Online	Offline
Last Visited	‎04-23-2020 07:35 PM

Member Since	‎09-29-2015 09:15 PM
Last Visited	‎04-23-2020 07:35 PM
Posts	67
Kudos received	113

Cloudera Community

Re: how to use path globs to copy files from local...

Re: Falcon tutorial - rawEmailIngestProcess shell-...

Re: How to bind host for oozie service in a Multih...

Re: Falcon replication and mirroring between two K...

Re: Oozie SSH Action - StrictHostKeyChecking=no

How can I add configuration files to a Spark job r...

Re: What is the best way to read a random line fro...

What is the best way to read a random line from hu...

Re: how to use path globs to copy files from local...

Re: Unable to build Spark application due to missi...

Re: Unable to build Spark application due to missi...

Unable to build Spark application due to missing j...

Is there a Spark Maven archetype for Java?

Re: Falcon tutorial - rawEmailIngestProcess shell-...

Visualize near-real-time stock price changes using...