Member since
09-21-2015
133
Posts
130
Kudos Received
24
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
348 | 12-17-2016 09:21 PM | |
320 | 11-01-2016 02:28 PM | |
68 | 09-23-2016 09:50 PM | |
116 | 09-21-2016 03:08 AM | |
85 | 09-19-2016 06:41 PM |
01-04-2017
02:05 PM
1 Kudo
Hi @Sankaraiah Narayanasamy, For full Phoenix SQL syntax, you can use Spark's native JDBC features.
... View more
12-23-2016
07:01 AM
1 Kudo
Hi @Boris Demerov, please also see @Wes Floyd's Storm & Kafka guide here. There's some overlap between it and @Constantin Stanca's recommendations, but you may find it useful anyway.
... View more
12-21-2016
10:26 PM
1 Kudo
I need to route specific "streams" of flowfiles to specific cluster nodes by FlowFile attribute value. Is there a way to take an attribute and ensure that all incoming FlowFiles with the same value for an attribute get distributed to the same cluster node?
... View more
Labels:
12-18-2016
08:17 PM
3 Kudos
Hi, @rudra prasad biswas, "select * from myTable" in Phoenix will only return columns defined in the table DDL. To include dynamic columns in query results, you must explicitly include them in the select clause. See the docs for an example. This is where views come in handy. Instead of manually defining dynamic columns in each query, you can define them once in a view, then select * from myView.
... View more
12-18-2016
07:49 PM
Hi, @rudra prasad biswas For this use case, consider a few different approaches: 1. Use a single array type "stoppage" column which stores each stop as a JSON element. To query for records where the last stoppage was NY, use something like "where stoppage[array_length(stoppage)-1] = 'New York". 2. Continue using dynamic columns, but include an additional column "stoppage_count" which you increment for each additional stoppage. You can use stoppage_count to tell your application which dynamic columns to include in the query. 3. Use a relational model where you have records of trips, and another table with "stoppages" linked to the trip ID. However, the above approaches assume your queries have an initial access pattern limiting the size of the scan. Assuming your table has primary key of customer_id, trip_id, you DON'T want to run a query like: select * from trips where stoppages[array_length(stoppages)-1] = 'New York' because it would be a full scan. Instead, you want it to be something like: select * from (
select * from trips where customer_id = '1234'
) a
where stoppages[array_length(stoppages)-1] = 'New York' which would be a much smaller range scan.
... View more
12-17-2016
09:21 PM
3 Kudos
Hi, @rudra prasad biswas, Column families are used to optimize RegionServer memory by separating frequently and infrequently used columns. For example, when using HBase for document storage, document metadata is queried 5-10x for every one read of document content. In this type of application, you would store document content in one column family, and document metadata in another, allowing HBase to avoid loading document content into memory until it is explicitly accessed. As @Ryan Cicak hinted at, Phoenix tables assign columns to the first column family by default. However, you can specify which family to assign columns to: CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR) In this example, COL3 is stored in column family B, and other columns belong to family A. Assuming there is a subset of column qualifiers you know you will use, they should be defined in the table definition (DDL). However, this is not required. See the Phoenix docs on dynamic columns and views to see how to use Phoenix to support writing arbitrary column qualifiers to a table without pre-defining those columns. Views let you keep track of and expose dynamic columns when needed.
... View more
12-12-2016
06:51 AM
1 Kudo
I have a very high incoming event rate and a parser written in Python. I'm able to run it with ExecuteStreamCommand, but the latency for starting/stopping a process for every incoming FlowFile is too high and I cant' keep up with my source. Is it possible for ExecuteStreamCommand to keep the external process alive and pass new FlowFiles after some signal?
... View more
Labels:
11-01-2016
02:28 PM
2 Kudos
Hi @Zack Riesland, Hive shell activity is like other bash programs. You can check the status of the previous command with $? -- unsuccessful queries/jobs will return a non-zero exit code. [root@dn0 /]# hive -e "show databas;"
Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user/root":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1827)
....
[root@dn0 /]# echo $?
1
... View more
10-13-2016
08:03 PM
Correct me if you know otherwise, but Squirrel isn't a SQL engine itself. It needs a backend to connect to before it can do much of anything. When doing data exploration outside of a Hadoop environment, I've had success using standalone Zeppelin/Spark as a way to run SQL against static files (including stored in S3).
... View more
10-13-2016
05:35 PM
S3a would be better, but depending on your cluster version, s3n will just work whereas there are kinks with S3a yet to be worked out.
... View more
10-13-2016
05:33 PM
@Zack Riesland, hive -e "create table my_export stored as text location 's3n://my_bucket/my_export' as select * from my_table;" Or some variation of the above works pretty well. This will be a parallel write into s3 as if it was just another directory on HDFS. You will need to use AWS APIs to configure needed security policies on the bucket and/or "subfolders".
... View more
10-11-2016
08:24 PM
@Sunile Manjee, Depending on what data your Spark application attempts to access, it uses the relevant JVM HBase client APIs: filters, scans, gets, range-gets, etc. See the code here.
... View more
10-07-2016
04:18 PM
Worth noting: the above works for 's3n://' urls, but not 's3a://' or plain 's3://'
... View more
10-07-2016
03:20 PM
I am trying to create a table in S3 using HiveQL. I have added the following access key configs to HDFS core-site.xml and hive-site.xml: fs.s3.awsAccessKeyId, fs.s3n.awsAccessKeyId, fs.s3a.awsAccessKeyId fs.s3.awsSecretAccessKey, fs.s3n.awsSecretAccessKey, fs.s3a.awsSecretAccessKey I have added what I believe are the relevant AWS jars to Hive's CP: hive> add jar /usr/hdp/current/hadoop-client/hadoop-aws.jar; hive> add jar /usr/hdp/current/hadoop-client/lib/aws-java-sdk-s3-1.10.6.jar; hive> add jar /usr/hdp/current/hadoop-client/lib/aws-java-sdk-core-1.10.6.jar; Unfortunately, creating a table stored in s3a fails: hive> create table test_backup_a stored as orc location 's3a://hwx-randy/test_backup_a' as select * from test; ... Moving data to directory s3a://hwx-randy/test_backup_a Failed with exception Unable to move source hdfs://dn0.dev:8020/apps/hive/warehouse/.hive-staging_hive_2016-10-07_15-14-34_668_7427003017223534386-1/-ext-10001 to destination s3a://hwx-randy/test_backup_a FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
... View more
Labels:
10-05-2016
08:11 PM
1 Kudo
I've seen the nifi.variable.registry.properties config, but those seem to by loaded statically and only at NiFi start. I have a use case where connection strings and other environment settings need to be read periodically from a DataBase and used to control, for example, the hostname and directory for GetSFTP. How can I achieve this?
... View more
Labels:
10-05-2016
05:33 PM
1 Kudo
Thank you, @Michael Young!
... View more
10-05-2016
04:27 PM
09-27-2016
05:22 PM
He filters all cells by timestamp here. If I understand correctly, without the filter, the DataFrame would expose all cell versions that exist in the snapshot.
... View more
09-27-2016
05:07 PM
2 Kudos
@Artem Ervits See @Dan Zaratsian's examples reading cell versions and timestamps from a snapshot here.
... View more
09-23-2016
09:50 PM
1 Kudo
@Sunile Manjee I don't have stats, but you need to use Phoenix Bulk Load regardless, as HBase Bulk Load will not ensure consistent secondary indices, nor will it use the correct signing and byte ordering conventions that Phoenix needs.
... View more
09-23-2016
08:34 PM
Livy is not "hidden". If you have started the Livy server, you can interact with its REST API from any application.
... View more
09-22-2016
07:33 PM
*Edit*
Realistically, questions about shared SparkContexts are often about 1. Making shared use of cached DataFrames/DataSets Livy and the Spark Thrift JDBC/ODBC server are decent initial solutions. Keep an eye on Spark-LLAP integration which will be better all around (security, efficiency, etc.) 2. Problems with Spark applications consuming all of a cluster's resources. Spark's ability to spin up and spin down executor instances dynamically based on utilization is probably a better solution to this problem than sharing a single spark context.
... View more
09-22-2016
06:27 PM
1 Kudo
@Sunile Manjee Consider how Spark applications run: a driver runs either on the client, or in a YARN container. If multiple users will ask the same Spark application instance to do multiple things, they need an interface to communicate that to the Driver. Livy is the out of the box REST interface that shares a single Spark application by presenting the control interface to external users. If you do not want to use Livy, but still want to share a Spark context, you need to build an external means of communicating with the shared Driver. One solution might be to have the driver periodically pull new queries from a database or from files on disk. This functionality is not builtin to Spark, but could be implemented with a while loop and a sleep statement.
... View more
09-21-2016
03:08 AM
3 Kudos
@hduraiswamy - in order of preference
SyncSort
Use the mainframe’s
native JDBC services – often unacceptable as the mainframe must consume additional MIPS to convert into JDBC types before sending over the net Use
this open serde which unfortunately skips reading everything except fixed length fields, severely
limiting usefulness I've heard about LegStar being used for similar projects, but am not sure how.
... View more
09-19-2016
06:41 PM
2 Kudos
@Andrew Watson You can set cell level acls via the HBase shell or via HBase's Java API. This type of policy is not exposed or controlled via Ranger. If possible, I would implement row level policies in a client-side application, as HBase's cell ACLs are expensive (additional metadata must be stored and read with every cell). My favorite solution is to create a Phoenix View that exposes only specific rows. As noted above, your client-side app would have to decide whether to allow access to a given view.
... View more
09-14-2016
06:29 PM
3 Kudos
@Carlos Barichello Livy isn't hidden. If you've started Livy, you can use its REST API to launch Spark jobs from Zeppelin or from elsewhere.
... View more
09-14-2016
05:24 PM
1 Kudo
Hi @Mukesh Kumar. Storing the metadata in HBase is a great design. Whether the content itself should go in HBase or HDFS directly depends on content size. HBase now has medium object support, which means content up to a few MB is fine, particularly if you store the metadata and actual content in separate column families. On the UI front, if you have files stored in HDFS, you can use string concatenation to embed the filename in a WebHDFS url: <a href="http://<HOST>:<PORT>/webhdfs/v1/user/dev/images/img1.gif?op=OPEN">Link</a>, which will download as a file when clicked. Note, I've done this in Zeppelin, but haven't tried it in the Hive View or in Hue. If you're accessing content from HBase, you'll need a service to front HTTP calls. The Phoenix Query Server may make this possible out of the box, but I haven't tried.
... View more
09-12-2016
10:07 PM
2 Kudos
Hi, @gkeys, following the RedShift example here, the properties in your JDBC interpreter settings would be phoenix.user, phoenix.driver, phoenix.url, instead. Since you have modified the "default" prefix, does the plain "%jdbc" not work for the interpreter setting?
... View more
09-08-2016
07:49 PM
@Jasper I see your point. For further (external) example of using the Spark Data Source API to pass a predicate, see this comment. As a predicate, can you try: "where keyCol1 > X and keyCol2 < Y"
What does the explain plan say if you issue that filter in a normal JDBC query?
... View more