Member since
02-04-2016
189
Posts
70
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1974 | 07-12-2018 01:58 PM | |
4467 | 03-08-2018 10:44 AM | |
1453 | 06-24-2017 11:18 AM | |
16528 | 02-10-2017 04:54 PM | |
1272 | 01-19-2017 01:41 PM |
05-28-2019
06:08 PM
It took me a while to look in /var/log/messages, but I found a ton of ntpd errors. It turns out that our nodes were having issues getting out to the servers they were configured to use for sync. I switched all the configurations to use a local premise server and restarted everything. I'm hoping that will be the full solution to our issue.
... View more
05-26-2019
11:52 AM
Thanks @Geoffrey Shelton Okot Just to clarify, we corrected all the hosts files and re-started all the services. I have a hunch that there are is some hbase data somewhere that is now corrupt because it is associated with the incorrect fqdn. But I wouldn't expect hive to have any relationship to hbase. Does zookeeper use hbase for record keeping?
... View more
05-25-2019
12:03 PM
Hello, We've recently been seeing some weird behavior from our cluster. Things will work well for a day or two, and then Hive server and several region servers will go offline. When I dig into the logs, they all reference zookeeper: 2019-05-24 20:12:15,108 ERROR nodes.PersistentEphemeralNode (PersistentEphemeralNode.java:deleteNode(323)) - Deleting node: /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:239)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:234)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:215)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:42)
at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.deleteNode(PersistentEphemeralNode.java:315)
at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.close(PersistentEphemeralNode.java:274)
at org.apache.hive.service.server.HiveServer2$DeRegisterWatcher.process(HiveServer2.java:334)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:61)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2019-05-24 20:12:15,110 ERROR server.HiveServer2 (HiveServer2.java:process(338)) - Failed to close the persistent ephemeral znode However, when I look in the zookeeper logs, I don't see anything. If I re-start the failed services, they will run for several hours, and then the process repeats. We haven't changed any settings on the cluster, BUT, 2 things have changed recently: 1 - A couple weeks ago, some IT guys made a mistake and accidentally changed the /etc/hosts files We fixed this, and re-started everything on the cluster. 2 - Those changes in (1) were part of some major network changes and we seem to have a lot more latency. With all of that said, I really need some help figuring this out. Could it be stale HBase wal files somewhere? Could that cause Hive server to fail? Is there a zookeeper timeout setting I can change to help? Any tips would be much appreciated.
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
05-10-2019
06:50 PM
Here's a weird scenario I'm trying to understand: We have an on-premise cluster running HDP 2.6.X We haven't changed any of the settings or hardware in a long time. Suddenly, when we try to open the "hive" cli from a data node or an edge node, it fails regularly. No error. It just hangs. This happens when there is NOTHING else going on on the cluster. No queries. No full queues. No running applications. HDFS is only about 70% full. When I look at the application manager, it creates an application that is "accepted". If I drill into it, I see this: "Application is Activated, waiting for resources to be assigned for AM. Last Node which was processed for the application : server.name:45454 ( Partition : [], Total resource : <memory:193024, vCores:16>, Available resource : <memory:193024, vCores:16> ). Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:2702336, vCores:224> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 4.187192 % ; Queue's Absolute max capacity = 100.0 % ; " Suddenly this week, it just starting being really inconsistent. We think there must be some kind of weird networking issue going on behind the scenes - we're at the mercy of IT to know what might have changed there. But I would really appreciate some help troubleshooting.
... View more
Labels:
10-11-2018
05:50 PM
I'm following the instructions here, but I still can't seem to get it working. I'm pointing to the same driver with the same driver class name. However, my URL is a bit different. I'm trying to connect to an AWS Aurora instance. If I use a connection string like this: jdbc:mysql://my-url.us-east-1.rds.amazonaws.com:1433;databaseName=my_db_name I get the error below. If I remove the port and db name, I get an error that says "Unable to execute SQL ... due to org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (Communications link failure" Any ideas? ExecuteSQL[id=df4d1531-3056-1f5a-9d32-fa30462c23ba] Unable to execute SQL select query <query> for StandardFlowFileRecord[uuid=7d70ed35-ae97-47e8-a860-0a2fa75fa2ef,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1538684794331-5, container=default, section=5], offset=939153, length=967],offset=0,name=properties.json,size=967] due to org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (Cannot load connection class because of underlying exception: 'java.lang.NumberFormatException: For input string: "1433;databaseName=my_db_name"'.); routing to failure: org.apache.nifi.processor.exception.ProcessException: org.apache.commons.dbcp.SQLNestedException...
... View more
07-12-2018
01:58 PM
I was able to get this to work by using the insertInto() function, rather than the saveAsTable() function.
... View more
07-12-2018
10:29 AM
Thanks @hmatta Printing schema for sqlDFProdDedup:
root
|-- time_of_event_day: date (nullable = true)
|-- endpoint_id: integer (nullable = true)
...
|-- time_of_event: integer (nullable = true)
...
|-- source_file_name: string (nullable = true)
Printing schema for deviceData:
root
...
|-- endpoint_id: integer (nullable = true)
|-- source_file_name: string (nullable = true)
...
|-- start_dt_unix: long (nullable = true)
|-- end_dt_unix: long (nullable = true)
Printing schema for incrementalKeyed (result of joining 2 sets above):
root
|-- source_file_name: string (nullable = true)
|-- ingest_timestamp: timestamp (nullable = false)
...
|-- endpoint_id: integer (nullable = true)
...
|-- time_of_event: integer (nullable = true)
...
|-- time_of_event_day: date (nullable = true)
... View more
07-11-2018
06:38 PM
I have a hive table (in the glue metastore in AWS) like this:
CREATE EXTERNAL TABLE `events_keyed`(
`source_file_name` string,
`ingest_timestamp` timestamp,
...
`time_of_event` int
...)
PARTITIONED BY (
`time_of_event_day` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'my_location'
TBLPROPERTIES (
'PARQUET.COMPRESSION'='SNAPPY',
'transient_lastDdlTime'='1531187782')
I want to append data to it from spark:
val deviceData = hiveContext.table(deviceDataDBName + "." + deviceDataTableName)
val incrementalKeyed = sqlDFProdDedup.join(broadcast(deviceData),
$"prod_clean.endpoint_id" === $"$deviceDataTableName.endpoint_id"
&& $"prod_clean.time_of_event" >= $"$deviceDataTableName.start_dt_unix"
&& $"prod_clean.time_of_event" <= coalesce($"$deviceDataTableName.end_dt_unix"),
"inner")
.select(
$"prod_clean.source_file_name",
$"prod_clean.ingest_timestamp",
...
$"prod_clean.time_of_event",
...
$"prod_clean.time_of_event_day"
)
// this show good data:
incrementalKeyed.show(20, false)
incrementalKeyed.repartition($"time_of_event_day")
.write
.partitionBy("time_of_event_day")
.format("hive")
.mode("append")
.saveAsTable(outputDBName + "." + outputTableName + "_keyed")
But this gives me a failure:
Exception encountered reading prod data:
org.apache.spark.SparkException: Requested partitioning does not match the events_keyed table:
Requested partitions:
Table partitions: time_of_event_day
What am I doing wrong? How can I accomplish the append operation I'm trying to get?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
06-14-2018
10:46 AM
I'm sorry, @shaleen somani - this was over a year ago and I don't remember the details any more. My guess is that our primary and secondary name nodes had failed over for some reason. I've found that when this happens, things continue to "work", but not quite right and it can be hard to pin down. You can use the hdfs haadmin utility to check the status. Good luck!
... View more
06-11-2018
04:49 PM
I'm having some trouble optimizing a query and hoping someone can see something I'm missing. Basically, I have a series of statements that follow the pattern below. I create a table, then populate it using a UDF. Then I create a partitioned copy of the table and copy the data into it (TEZ won't allow me to combine the UDF step and the partitioning step). Here's my big issue: On my big insert statement (INSERT INTO TABLE ${hiveconf:dbName}.${hiveconf:prod_table_name}_keyed) I see all the heavy lifting happen (242 mappers, then 1 mapper, then 1 reducer). But then the actual writing of the data to disk (the final 10 reducers for the 10 buckets) takes longer than everything else combined. The reason I added the buckets is because the next step, where I partition the results, has a similar issue - a very slow, single reducer. So I was hoping that by forcing the data into 10 buckets, I could get 10 reducers and it would run faster. But regardless, I'm getting a horrible bottleneck at the end of the query that has to do with IO latency. Can anyone suggest a way to improve this? Thanks! CREATE external TABLE ${hiveconf:dbName}.${hiveconf:prod_table_name}_keyed
(
source_file_name string,
ingest_timestamp timestamp,
...
20 other columns
...
)
CLUSTERED BY (sample_point) INTO 10 BUCKETS
stored as parquet
LOCATION '${hiveconf:incremental_data_path}_keyed_temp'
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY');
CREATE TEMPORARY FUNCTION func_1 as 'com.do.stuff' USING JAR '/home/hadoop/my-jar';
CREATE TEMPORARY FUNCTION func_2 as 'com.do.other.stuff' USING JAR '/home/hadoop/may-jar';
INSERT INTO TABLE ${hiveconf:dbName}.${hiveconf:prod_table_name}_keyed
select
x.source_file_name,
x.ingest_timestamp,
...
other columns
...
from
(
select
s.source_file_name,
s.ingest_timestamp,
...
other columns
...
func_1(input_column) as unit_of_measure,
func_2(input_columns2) as something_else
from
(
select
er.source_file_name,
er.ingest_timestamp,
...
other columns
...
from ${hiveconf:dbName}.${hiveconf:prod_table_name} er
inner join ${hiveconf:db_name}.${hiveconf:table_name} edd
where <clause>
distribute by something
sort by something_else asc
) s
) x ;
CREATE external TABLE ${hiveconf:dbName}.${hiveconf:prod_table_name}_keyed_partitioned
(
source_file_name string,
ingest_timestamp timestamp,
...
other columns
...
)
PARTITIONED BY (sample_point_day date)
stored as parquet
LOCATION '${hiveconf:incremental_data_path}_keyed'
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY');
insert into table ${hiveconf:dbName}.${hiveconf:prod_table_name}_keyed_partitioned
PARTITION (sample_point_day)
select
r.source_file_name,
r.ingest_timestamp,
r.other_columns
from ${hiveconf:dbName}.${hiveconf:prod_table_name}_keyed r;
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
06-07-2018
10:48 AM
Agreed! It certainly appears to be a bug.
... View more
06-06-2018
07:07 PM
Suppose I have some data in s3: s3://my_bucket/my_path/to/my/data/myfile.txt And suppose I use a ListS3 processor with the bucket and pass "my_path/to/my/data/" as the prefix I will get TWO flow files: "s3://my_bucket/my_path/to/my/data/myfile.txt" and "s3://my_bucket/my_path/to/my/data/" even though the latter is just a partial key that doesn't represent an object. How can I tune my settings to only get the entry for "myfile.txt"? Thanks in advance!
... View more
Labels:
- Labels:
-
Apache NiFi
05-24-2018
12:04 PM
Thanks Matt, My issue was firewall related. I'm all set now. Thanks for your help!
... View more
05-23-2018
08:27 PM
Thanks @Matt Clarke You must be back from the NSA days 🙂 Your message is helpful, but I'm still not able to access the from my browser laptop. Here's what I've got: I have a RHEL 7.5 server running in EC2, in a VPC. It's running Nifi 1.6.0 using all vanilla settings. I can access the server using NoMachine and interract with Nifi in the browser directly on the machine. I added a SecurityGroup to open port 8080. As you said, the logs list about 4 different URLs - they are all different IPs associated with the machine. But none of them work from my laptop (which is in the VPC via VPN). I also tried setting the nifi.web.http.host value, and I also tried changing to a different port (restarting after each change). I even tried setting the Security Group to allow "all traffic" from "everywhere". So I don't think ports are the issue. (Interestingly, if I set the nifi.web.http.host value, I am no longer able to access nifi in the browser on the host machine using 'localhost') So... Any other ideas? I'm feeling a little stuck...
... View more
05-23-2018
07:38 PM
Thanks @Matt Clarke My mistake on the response - I clicked "reply" but apparently managed to type in the wrong box... I have one followup question since you seem to know Nifi - Simply opening access to port 8080 on the Nifi server doesn't appear to be sufficient for making it accessible to other computers on the same network. I've been looking for some instructions, and everything I've found points to setting up HTTPS, certificates, keys, etc. (like this https://bryanbende.com/development/2016/08/17/apache-nifi-1-0-0-authorization-and-multi-tenancy) Is that the only option? For reference, this is running in a VPC and only machines with VPN access can see the server at all. Thanks!
... View more
05-23-2018
04:48 PM
Thanks @Matt Clarke So I'm thinking I'll open up the port so that different devs can access the flow through the browser (it's all protected by VPN) and utilize process groups to help isolate distinct pieces. Does that sound like a good plan?
... View more
05-23-2018
04:06 PM
I would like to introduce Nifi as a tool for controlling a top-level work flow, but I want it to be something that my whole team can access and maintain, and I'm wondering about best practices in this context. For example, we currently have a single Nifi instance with a single flow on a shared server. So anyone on the team can RDP to the server and see/edit the flow at localhost:8080 But only one person at a time. But what if we want multiple flows and the ability for multiple devs to have access at the same time? At a high level, it looks like we could run multiple instances of Nifi and just have a record somewhere that localhost:8080 is prod and localhost:8090 is dev or something like that. But that still doesn't allow admin A to work on prod and admin B to work on Dev at the same time. They would have to make changes on separate machine and then deploy the XML. Even if we opened up the ports so that Nifi is accessible through the browser on a remote machine, how does it work if 2 devs are editing at the same time? Is that ok as long as they are in separate process groups? I'm trying to understand the options and best practices for this scenario. Thanks!
... View more
Labels:
- Labels:
-
Apache NiFi
04-20-2018
05:01 PM
Suppose you have some scala code with a "timestamp" variable: val timestamp = 1234567 How can I pass that to a "where" or "filter" clause on a dataframe? I want something like this, but this doesn't work: val dfCorruptData = dfClean.where("sample_point <= 1388534400 or sample_point >= $timestamp")
... View more
Labels:
- Labels:
-
Apache Spark
04-10-2018
08:31 PM
Here's what I ended up with: spark.udf.register("getOnlyFileName", (fullPath: String) => fullPath.split("/").last) val df2= df1.withColumn("source_file_name2", callUDF("getOnlyFileName", input_file_name()))
... View more
04-10-2018
05:19 PM
THanks @Amol Thacker One quick followup: do you know what the syntax would be to strip the path from the file name? So, convert /my/path/to/my/file.txt to file.txt I'm new to scala and struggling w/ syntax...
... View more
04-10-2018
03:21 PM
I'm using Scala to read data from S3, and then perform some analysis on it. Suppose that in /path/to/my/data, there are 4 "chunks": a.parquet, b.parquet, c.parquet, and d.parquet In my results, I want one of the columns to show which chunk the data came from. Is that possible, and if so, how? val df = spark.read.parquet("s3://path/to/my/data")
val frame = spark.sql( s""" SELECT some things... """);
... View more
Labels:
- Labels:
-
Apache Spark
03-08-2018
10:44 AM
I wrestled with the Java and hdfs3 options mentioned above, but getting either of them to run on EMR was pretty painful and not very bootstrap script friendly. Finally, I figured out how to do this with Spark. It's lightning fast, and super simple: val hdfs = FileSystem.get(sc.hadoopConfiguration)
val files = hdfs.listStatus(new Path(args(0)))
val originalPath = files.map(_.getPath())
println("Will now list files in " + args(0) + "...")
for(i <- originalPath.indices)
{
val id = randomUUID().toString;
println("Will move " + originalPath(i) + " to " + id);
hdfs.rename(originalPath(i), originalPath(i).suffix("." + id))
}
... View more
03-01-2018
08:42 PM
Thanks @Matt Foley, The insight that renaming via the Java API is so much faster is especially interesting. I'll investigate that further!
... View more
03-01-2018
01:28 PM
#!/bin/bash
scriptname=`basename "$0"`
echo ""
echo "Running $scriptname $@..."
echo " (Usage: $scriptname <path_to_data>)"
echo ""
if [ "$#" -ne 1 ]
then
echo "Wrong number of arguments. Expected 1 but got $#"
exit 1;
fi
SECONDS=0
HDFS_PATH="$1"
for partition_name in `hdfs dfs -ls $HDFS_PATH`
do
if [[ $partition_name == $HDFS_PATH* ]]
then
echo "Looping through $partition_name"
for chunk_name in `hdfs dfs -ls $partition_name`
do
if [[ $chunk_name == $partition_name* ]]
then
UUID=$(uuidgen)
UUID=${UUID^^}
echo "Will rename $chunk_name to $partition_name/$UUID"
# hdfs dfs -mv $filename "$HDFS_PATH/$UUID"
fi
done
fi
done
duration=$SECONDS
echo "Exiting after $duration"
exit 0;
<br>
... View more
03-01-2018
01:27 PM
Thanks @Matt Foley A few clarifications for whatever they're worth: You don't say what application you're using to pull and clean the new data on HDFS. - I pull the data using s3-dist-cp, and then project a table over it with Hive, and run Spark and Hive queries for ETL. It's not an HDFS thing - I suppose it's technically a Hive thing. Still, I have the same question - can I control how Hive names the chunks? Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. - No. I'm actually renaming on HDFS before pushing to S3. And yes, "hdfs dfs -mv ..." does take about 3 seconds per file. I can prove it if you're interested. I'll attach my script for reference. Regarding your last comment - I do understand how S3 works. I do NOT know of a s3-dist-cp option to force a prefix or convention to the individual chunks. For example, if I have a bunch of data representing a table at /data/my/table in HDFS, I can push that to any prefix in S3, but I don't know how to specify that each chunk under /data/my/table should be renamed. I can push the chunks INDIVIDUALLY and control the name, but then my app is no longer scalable. The length of time increases linearly with the size of the data, regardless of the size of the cluster. That's why I'm trying to leverage s3-dist-cp - it's the only way I have found to push data from a cluster to s3 in a scalable way.
... View more
02-27-2018
09:21 PM
Here's my scenario: I have an S3 bucket full of partitioned production data: data_day=01-01-2017/000000_0 data_day=01-01-2017/000000_1 data_day=01-02-2017/000000_0 data_day=01-02-2017/000000_1 ... etc I spin up an EMR cluster and pull down some dirty data and clean it up, including de-duplicating it against the prod data. Now, on my cluster, in HDFS, I have maybe data_day=01-01-2017/000000_0 data_day=01-02-2017/000000_0 This represents new data: I know that I can create a table and point the 'location' at the bucket described above and do an "insert into" or an "insert overwrite", but this is very slow - it will use one reducer that will copy ALL the new data. Instead, I want to use s3-dist-cp which will update the data much more quickly. However, my 000000_0 chunks will overwrite the old ones. I have a script that renames the chunks: 000000_0 -> BCF704E2-B8A7-4F71-8747-A68AD52E50B7 but it takes about 3 seconds per partition, which is over an hour. So, here's my question: is there a HFDS setting to change the way the chunks are named? For example, can I force the chunks to be named using the date or a GUI? Thanks in advance
... View more
Labels:
- Labels:
-
Apache Hadoop
01-04-2018
11:54 AM
Hey everyone, I have a somewhat similar question, which I posted here: https://community.hortonworks.com/questions/155681/how-to-defragment-hdfs-data.html I would really appreciate any ideas. cc @Lester Martin @Jagatheesh Ramakrishnan @rbiswas
... View more
01-03-2018
07:59 PM
Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12"). Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc). The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries. But suppose these chunks are mostly very small - is there a simple command to consolidate these? So instead of 100 small files under a partition, I get 2 well-sized ones, etc. I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy. Is there some kind of hdfs command to "defrag" the data? FWIW, I'm using EMR with data in S3. Thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hive
11-11-2017
11:42 AM
I have a similar question. In my case, I need to connect to Hive using a Sas tool that only provides me with the following fields: Host(s) Port Database And then there is a tool to add "server side properties", which creates a list of key/value pairs. Can anyone tell me what server side properties I can use to force this connection to always use a specific queue? Or, a way to associate this connection with a user and associate that user with a key/value pair?
... View more