Member since
09-29-2015
67
Posts
115
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1268 | 01-05-2016 05:03 PM | |
1928 | 12-31-2015 07:02 PM | |
1790 | 11-04-2015 03:38 PM | |
2220 | 10-19-2015 01:42 AM | |
1304 | 10-15-2015 02:22 PM |
01-09-2016
07:32 AM
3 Kudos
After reading the Spark documentation and source code, I can find two ways to reference an external configuration file inside of a Spark (v1.4.1) job, but I'm unable to get either one of them to work. Method 1: from Spark documentation says to use ./bin/spark-submit --files /tmp/test_file.txt, but doesn't specify how to retrieve that file inside of a Spark job written in Java. I see it being added, but I don't see any configuration parameter in Java that will point me to the destination directory INFO Client: Uploading resource file:/tmp/test_file.txt -> hdfs://sandbox.hortonworks.com:8020/user/guest/.sparkStaging/application_1452310382039_0019/test_file.txt Method 2: from Spark source code suggests to use SparkContext.addFile(...) and SparkContext.textFile(SparkFiles.get(...)), but that doesn't work either as that directory doesn't exist in HDFS (only locally). I see this in the output of spark-submit --master yarn-client 16/01/09 07:10:09 INFO Utils: Copying /tmp/test_file.txt to /tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
16/01/09 07:10:09 INFO SparkContext: Added file /tmp/test_file.txt at http://192.168.1.13:39397/files/test_file.txt with timestamp 1452323409690
.
.
16/01/09 07:10:17 INFO SparkContext: Created broadcast 5 from textFile at Main.java:72
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/tmp/spark-8439cc21-656a-4f52-a87d-c151b88ff0d4/userFiles-00f58472-f947-4135-985b-fdb8cf4a1474/test_file.txt
... View more
Labels:
01-07-2016
11:00 PM
@Simon Elliston Ball I think you are suggesting option #4; would you provide more detail of how that might work? If I understand the suggestion correctly, this would be the pseudo-code: JavaPairRDD<String, Float> allSamples;
for (fileName in fileNames) {
JavaRDD<String> file = sc.textFile(fileName);
JavaRDD<String> sample = file.sample(true, 1,000,000 / file.count());
JavaPairRDD<String, Float> fileToSample = sample.mapToPair(x -> {
Float importantElement = /* extract from line */
return new Tuple2<>(fileName , importantElement);
});
allSamples.union(fileToSample);
}
At the end of this the allSamples RDD will be a 2D matrix with the rows representing each file (~100 rows) and the columns representing each iteration (~1M columns).
How do I perform aggregation to sum all elements in each column?
... View more
01-06-2016
12:45 AM
1 Kudo
I'm writing an iterative Spark application that needs to read a random line from hundreds of files and then aggregate the data in each iteration. The number of files is small: ~100, and each one is small in size: <1MB, but both will grow in the future. Each file has the exact same CSV schema and all of them live in the same directory in HDFS. In pseudo-code, the application would look like this: for each trial in 1..1,000,000:
val total = 0
for file in all files:
val = open file and read random line
total += val
done
return total
done I see the following possibilities to accomplish this in Spark: Execute ~1M iterations and in each one open ~100 files, read one line, and perform aggregation (the approach in pseudo-code above). This is simple, but very I/O intensive because there will be 1M * 100 calls to open a file in HDFS. Place the contents of each of the ~100 files into memory in the driver program and then broadcast that to each of the ~1M iterations. Each iteration would read a random line from ~100 in-memory objects, then aggregate the results. This is better, but each object has to be serialized and transferred over the network from the driver program. Create an external Hive table and in each iteration execute select queries to fetch a random row, then aggregate the results. Execute ~100 iterations and in each one open a single file and read ~1M lines from it at random. Each iteration would return an list of values ~1M long and all of the aggregation would be performed in the driver program. What is the best approach?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
01-05-2016
05:03 PM
2 Kudos
You can do something like this: hdfs dfs -put *filename*[0-9].txt /tmp
For example: $ touch ${RANDOM}filename-$(date +"%F").txt ${RANDOM}filename.txt
$ ls *filename*.txt
17558filename-2016-01-05.txt 27880filename.txt
$ hdfs dfs -put *filename*[0-9].txt /tmp
$ hdfs dfs -ls /tmp
-rw-r--r-- 3 hdfs hdfs 0 2016-01-05 16:39 /tmp/17558filename-2016-01-05.txt
If that doesn't work, add this to the beginning: set +f
... View more
01-05-2016
12:15 AM
That worked! I was missing the extra HDP Jetty repo: <url>http://repo.hortonworks.com/content/repositories/jetty-hadoop/</url>
... View more
01-04-2016
11:10 PM
Because that doesn't work as I'm not trying to build the hadoop-auth package. Setting '-Djetty.version=6.1.26' does nothing since that variable isn't used anywhere in my Spark application.
... View more
01-04-2016
08:10 PM
I'm trying to build a simple Spark Java application that pulls its dependencies from the HDP Releases repository. My project only depends on: <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2.2.3.4.0-3485</version>
<scope>provided</scope>
</dependency> Through a complex web of dependencies jetty-util version 6.1.26.hwx is required, but it is not found in any publicly visible repository. Where can I find this dependency so that I can build a Spark application that uses Hortonworks packaged jars? Is it best to exclude the jetty-util dependency from spark-core_2.10 and then explicitly add it to the project with a non-HWX version? (This question is similar to https://community.hortonworks.com/questions/6256/j... but the solution posted there does not work for my scenario.)
... View more
Labels:
- Labels:
-
Apache Spark
01-03-2016
07:06 PM
2 Kudos
Is there a Maven Archetype that does the following? Setup HDP Maven repository Create Maven profiles to download different versions of Spark and Hadoop jars based on HDP version Setup for building an uber jar Add a sample Java application and test case to the project Add winutils binaries to run test-cases on Windows I found this: https://github.com/spark-in-action/scala-archetype-sparkinaction, which does #4, but for Scala.
... View more
Labels:
- Labels:
-
Apache Spark
12-31-2015
07:02 PM
This Oozie log doesn't contain the details of why shell action failed. To get those details, look at the YARN job logs for this action. Click on "YARN" in Ambari, then on "Resource Manager UI" under "Quick Links" and then look at the logs for the application which contains the Oozie workflow ID in it. My guess is that there was a permission problem when writing e-mail content.
... View more
12-30-2015
10:40 PM
13 Kudos
Visualize near-real-time stock price changes using Solr and Banana UI
The goal of this tutorial is to create a moving chart that shows the changes in price of a few stock symbols, similar to Google Finance or Yahoo Finance.
Summary of steps
Download and install the HDP Sandbox
Download and install the latest NiFi release
Create a Solr dashboard to visualize the results
Create a new NiFi flow to pull from Google Finance API, transform, and store in HBase and Solr
Step-by-step
1. Download and install the HDP Sandbox
Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP.
On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below:
192.168.56.102 sandbox sandbox.hortonworks.com
2. Download and install the latest NiFi release
Follow the directions here. These were the steps that I executed for 0.4.1
cd /tmp
wget http://apache.cs.utah.edu/nifi/0.4.1/nifi-0.4.1-bin.zip
cd /opt/
unzip /tmp/nifi-0.4.1-bin.zip
useradd nifi
chown -R nifi:nifi /opt/nifi-0.4.1/
perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.4.1/conf/bootstrap.conf
perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.4.1/conf/nifi.properties
/opt/nifi-0.4.1/bin/nifi.sh start
3. Create a Solr dashboard to visualize the results
Download a new Solr dashboard, start the service, and create a new collection to store stock price changes:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
wget https://raw.githubusercontent.com/vzlatkin/Stocks2HBaseAndSolr/master/Solr%20Dashboard.json -O /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json
/opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181
/opt/hostname-hdpsearch/solr/bin/solr create -c stocks -d data_driven_schema_configs -s 1 -rf 1
4. Create a new NiFi flow to pull from Google Finance API, transform, and store in HBase and Solr
Solr is used for indexing the data, Banana UI is used for visualization, and HBase is used for future-proofing. HBase can be used to further analyze the data from Storm/Spark or to create a custom UI. The get the data into these tools, follow the steps below:
Start HBase via Ambari
Create a new table:
hbase shell
hbase(main):001:0> create 'stocks', 'cf'
Then download this NiFi template to your host machine.
To import the template, open the NiFi UI
Open Templates manager:
Find the template on your local machine and import it:
Drag and drop to instantiate a new template:
Double click the new process group:
You'll need to enable the HBase shared controller. To do so, click the right mouse button over the "Send to HBase" process, then click "Configure", then "Properties" and the "Go to" arrow to access the controller. Finally, click the "Enable" button.
Now start all of the processes. Hold down the Shift-key, and select all of the processes on the screen. Then click the start button:
You should see a flow that looks like the below screenshot
The reason for so many processes is that the response from Google Finance API needs to be transformed. First, we remove the comment characters '//' from the response. Second, we split the array into individual JSON objects. Third, we extract the relevant attributes. Fourth, the timestamp has the format of UTC, but it is actually in EST timezone, therefore, we fix that. Finally, we send the information to HBase, Solr, and the NiFi bulletin board for logging.
Conclusion
Now open the Banana UI. If you are doing this when the US stock markets are open (9:30am to 4pm Eastern Time), then you should see a dashboard similar to the below.
Full source code is available in GitHub.
... View more
Labels:
- « Previous
- Next »