About zack_riesland

zack_riesland · ‎05-23-2018

Thanks @Matt Clarke My mistake on the response - I clicked "reply" but apparently managed to type in the wrong box... I have one followup question since you seem to know Nifi - Simply opening access to port 8080 on the Nifi server doesn't appear to be sufficient for making it accessible to other computers on the same network. I've been looking for some instructions, and everything I've found points to setting up HTTPS, certificates, keys, etc. (like this https://bryanbende.com/development/2016/08/17/apache-nifi-1-0-0-authorization-and-multi-tenancy) Is that the only option? For reference, this is running in a VPC and only machines with VPN access can see the server at all. Thanks!

zack_riesland · ‎05-23-2018

Thanks @Matt Clarke So I'm thinking I'll open up the port so that different devs can access the flow through the browser (it's all protected by VPN) and utilize process groups to help isolate distinct pieces. Does that sound like a good plan?

zack_riesland · ‎05-23-2018

I would like to introduce Nifi as a tool for controlling a top-level work flow, but I want it to be something that my whole team can access and maintain, and I'm wondering about best practices in this context. For example, we currently have a single Nifi instance with a single flow on a shared server. So anyone on the team can RDP to the server and see/edit the flow at localhost:8080 But only one person at a time. But what if we want multiple flows and the ability for multiple devs to have access at the same time? At a high level, it looks like we could run multiple instances of Nifi and just have a record somewhere that localhost:8080 is prod and localhost:8090 is dev or something like that. But that still doesn't allow admin A to work on prod and admin B to work on Dev at the same time. They would have to make changes on separate machine and then deploy the XML. Even if we opened up the ports so that Nifi is accessible through the browser on a remote machine, how does it work if 2 devs are editing at the same time? Is that ok as long as they are in separate process groups? I'm trying to understand the options and best practices for this scenario. Thanks!

zack_riesland · ‎04-10-2018

Here's what I ended up with: spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last) val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))

zack_riesland · ‎04-10-2018

THanks @Amol Thacker One quick followup: do you know what the syntax would be to strip the path from the file name? So, convert /my/path/to/my/file.txt to file.txt I'm new to scala and struggling w/ syntax...

zack_riesland · ‎04-10-2018

I'm using Scala to read data from S3, and then perform some analysis on it. Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet In my results, I want one of the columns to show which chunk the data came from. Is that possible, and if so, how? val df = spark.read.parquet("s3://path/to/my/data") val frame = spark.sql( s""" SELECT some things... """);

zack_riesland · ‎03-08-2018

I wrestled with the Java and hdfs3 options mentioned above, but getting either of them to run on EMR was pretty painful and not very bootstrap script friendly. Finally, I figured out how to do this with Spark. It's lightning fast, and super simple: val hdfs = FileSystem.get(sc.hadoopConfiguration) val files = hdfs.listStatus(new Path(args(0))) val originalPath = files.map(_.getPath()) println("Will now list files in " + args(0) + "...") for(i <- originalPath.indices) { val id = randomUUID().toString; println("Will move " + originalPath(i) + " to " + id); hdfs.rename(originalPath(i), originalPath(i).suffix("." + id)) }

zack_riesland · ‎03-01-2018

Thanks @Matt Foley, The insight that renaming via the Java API is so much faster is especially interesting. I'll investigate that further!

zack_riesland · ‎03-01-2018

#!/bin/bash scriptname=`basename "$0"` echo "" echo "Running $scriptname $@..." echo " (Usage: $scriptname <path_to_data>)" echo "" if [ "$#" -ne 1 ] then echo "Wrong number of arguments. Expected 1 but got $#" exit 1; fi SECONDS=0 HDFS_PATH="$1" for partition_name in `hdfs dfs -ls $HDFS_PATH` do if [[ $partition_name == $HDFS_PATH* ]] then echo "Looping through $partition_name" for chunk_name in `hdfs dfs -ls $partition_name` do if [[ $chunk_name == $partition_name* ]] then UUID=$(uuidgen) UUID=${UUID^^} echo "Will rename $chunk_name to $partition_name/$UUID" # hdfs dfs -mv $filename "$HDFS_PATH/$UUID" fi done fi done duration=$SECONDS echo "Exiting after $duration" exit 0; <br>

zack_riesland · ‎03-01-2018

Thanks @Matt Foley A few clarifications for whatever they're worth: You don't say what application you're using to pull and clean the new data on HDFS. - I pull the data using s3-dist-cp, and then project a table over it with Hive, and run Spark and Hive queries for ETL. It's not an HDFS thing - I suppose it's technically a Hive thing. Still, I have the same question - can I control how Hive names the chunks? Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. - No. I'm actually renaming on HDFS before pushing to S3. And yes, "hdfs dfs -mv ..." does take about 3 seconds per file. I can prove it if you're interested. I'll attach my script for reference. Regarding your last comment - I do understand how S3 works. I do NOT know of a s3-dist-cp option to force a prefix or convention to the individual chunks. For example, if I have a bunch of data representing a table at /data/my/table in HDFS, I can push that to any prefix in S3, but I don't know how to specify that each chunk under /data/my/table should be renamed. I can push the chunks INDIVIDUALLY and control the name, but then my app is no longer scalable. The length of time increases linearly with the size of the data, regardless of the size of the cluster. That's why I'm trying to leverage s3-dist-cp - it's the only way I have found to push data from a cluster to s3 in a scalable way.

Online	Offline
Last Visited	‎06-10-2019 05:13 PM

Member Since	‎02-04-2016 01:07 PM
Last Visited	‎06-10-2019 05:13 PM
Posts	189
Kudos received	70

Cloudera Community

Re: Help with spark partition syntax (scala)

Re: Can I control naming patterns for HDFS chunks

Re: How to connect to Spark2 Thrift Server via JDB...

Re: Hive: Convert int timestamp to date

Re: How to clear temp data from dataflow / nifi?

Re: What are best practices for NiFi development i...

Re: What are best practices for NiFi development i...

What are best practices for NiFi development in a ...

Re: Access chunk name in Spark / Scala

Re: Access chunk name in Spark / Scala

Access chunk name in Spark / Scala

Re: Can I control naming patterns for HDFS chunks

Re: Can I control naming patterns for HDFS chunks

Re: Can I control naming patterns for HDFS chunks

Re: Can I control naming patterns for HDFS chunks