Member since
02-04-2016
189
Posts
70
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3740 | 07-12-2018 01:58 PM | |
7825 | 03-08-2018 10:44 AM | |
3745 | 06-24-2017 11:18 AM | |
23269 | 02-10-2017 04:54 PM | |
2286 | 01-19-2017 01:41 PM |
05-23-2018
07:38 PM
Thanks @Matt Clarke My mistake on the response - I clicked "reply" but apparently managed to type in the wrong box... I have one followup question since you seem to know Nifi - Simply opening access to port 8080 on the Nifi server doesn't appear to be sufficient for making it accessible to other computers on the same network. I've been looking for some instructions, and everything I've found points to setting up HTTPS, certificates, keys, etc. (like this https://bryanbende.com/development/2016/08/17/apache-nifi-1-0-0-authorization-and-multi-tenancy) Is that the only option? For reference, this is running in a VPC and only machines with VPN access can see the server at all. Thanks!
... View more
05-23-2018
04:48 PM
Thanks @Matt Clarke So I'm thinking I'll open up the port so that different devs can access the flow through the browser (it's all protected by VPN) and utilize process groups to help isolate distinct pieces. Does that sound like a good plan?
... View more
05-23-2018
04:06 PM
I would like to introduce Nifi as a tool for controlling a top-level work flow, but I want it to be something that my whole team can access and maintain, and I'm wondering about best practices in this context. For example, we currently have a single Nifi instance with a single flow on a shared server. So anyone on the team can RDP to the server and see/edit the flow at localhost:8080 But only one person at a time. But what if we want multiple flows and the ability for multiple devs to have access at the same time? At a high level, it looks like we could run multiple instances of Nifi and just have a record somewhere that localhost:8080 is prod and localhost:8090 is dev or something like that. But that still doesn't allow admin A to work on prod and admin B to work on Dev at the same time. They would have to make changes on separate machine and then deploy the XML. Even if we opened up the ports so that Nifi is accessible through the browser on a remote machine, how does it work if 2 devs are editing at the same time? Is that ok as long as they are in separate process groups? I'm trying to understand the options and best practices for this scenario. Thanks!
... View more
Labels:
- Labels:
-
Apache NiFi
04-10-2018
08:31 PM
Here's what I ended up with: spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last) val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))
... View more
04-10-2018
05:19 PM
THanks @Amol Thacker One quick followup: do you know what the syntax would be to strip the path from the file name? So, convert /my/path/to/my/file.txt to file.txt I'm new to scala and struggling w/ syntax...
... View more
04-10-2018
03:21 PM
I'm using Scala to read data from S3, and then perform some analysis on it. Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet In my results, I want one of the columns to show which chunk the data came from. Is that possible, and if so, how? val df = spark.read.parquet("s3://path/to/my/data")
val frame = spark.sql( s""" SELECT some things... """);
... View more
Labels:
- Labels:
-
Apache Spark
03-08-2018
10:44 AM
I wrestled with the Java and hdfs3 options mentioned above, but getting either of them to run on EMR was pretty painful and not very bootstrap script friendly. Finally, I figured out how to do this with Spark. It's lightning fast, and super simple: val hdfs = FileSystem.get(sc.hadoopConfiguration)
val files = hdfs.listStatus(new Path(args(0)))
val originalPath = files.map(_.getPath())
println("Will now list files in " + args(0) + "...")
for(i <- originalPath.indices)
{
val id = randomUUID().toString;
println("Will move " + originalPath(i) + " to " + id);
hdfs.rename(originalPath(i), originalPath(i).suffix("." + id))
}
... View more
03-01-2018
08:42 PM
Thanks @Matt Foley, The insight that renaming via the Java API is so much faster is especially interesting. I'll investigate that further!
... View more
03-01-2018
01:28 PM
#!/bin/bash
scriptname=`basename "$0"`
echo ""
echo "Running $scriptname $@..."
echo " (Usage: $scriptname <path_to_data>)"
echo ""
if [ "$#" -ne 1 ]
then
echo "Wrong number of arguments. Expected 1 but got $#"
exit 1;
fi
SECONDS=0
HDFS_PATH="$1"
for partition_name in `hdfs dfs -ls $HDFS_PATH`
do
if [[ $partition_name == $HDFS_PATH* ]]
then
echo "Looping through $partition_name"
for chunk_name in `hdfs dfs -ls $partition_name`
do
if [[ $chunk_name == $partition_name* ]]
then
UUID=$(uuidgen)
UUID=${UUID^^}
echo "Will rename $chunk_name to $partition_name/$UUID"
# hdfs dfs -mv $filename "$HDFS_PATH/$UUID"
fi
done
fi
done
duration=$SECONDS
echo "Exiting after $duration"
exit 0;
<br>
... View more
03-01-2018
01:27 PM
Thanks @Matt Foley A few clarifications for whatever they're worth: You don't say what application you're using to pull and clean the new data on HDFS. - I pull the data using s3-dist-cp, and then project a table over it with Hive, and run Spark and Hive queries for ETL. It's not an HDFS thing - I suppose it's technically a Hive thing. Still, I have the same question - can I control how Hive names the chunks? Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. - No. I'm actually renaming on HDFS before pushing to S3. And yes, "hdfs dfs -mv ..." does take about 3 seconds per file. I can prove it if you're interested. I'll attach my script for reference. Regarding your last comment - I do understand how S3 works. I do NOT know of a s3-dist-cp option to force a prefix or convention to the individual chunks. For example, if I have a bunch of data representing a table at /data/my/table in HDFS, I can push that to any prefix in S3, but I don't know how to specify that each chunk under /data/my/table should be renamed. I can push the chunks INDIVIDUALLY and control the name, but then my app is no longer scalable. The length of time increases linearly with the size of the data, regardless of the size of the cluster. That's why I'm trying to leverage s3-dist-cp - it's the only way I have found to push data from a cluster to s3 in a scalable way.
... View more