About awoolford

awoolford · ‎02-14-2018

You might start by using the `logger` command to send some sample syslog messages. Don't forget to add the `--port 1514` argument. Try running that on the Nifi host, and then on a host that's external to your Nifi cluster. If it works from a Nifi host but not from outside Nifi, you might need to tweak iptables or a firewall rule. You might try using tcpdump to monitor network traffic for port 1514. I'd also recommend running a `tail -f /var/log/nifi/nifi-app.log` on the Nifi host(s) while you're running the syslog listener to see if there are any interesting messages.

awoolford · ‎02-13-2018

Port 514 is a privileged port. This means it can only be accessed by a superuser. Since there are security implications to running Nifi as root, it is typically run as the nifi user. There are a couple of options: run the syslog listener on a port > 1024, e.g. port 1514 instead of 514. use iptables to forward port 514 external to a non-privileged internal port, and have the syslog listener listen to that port. use authbind to allow the Nifi user permissions to bind to port 514.

awoolford · ‎10-17-2017

In Cloudbreak, there are two ways to launch clusters on Azure: interactive login: requires admin or co-admin credentials on Azure. I don't have these permissions. app based: can deploy a cluster using an existing 'Contributor' role. Cloudbreak requires the following attributes in order to launch a cluster using the app based method: subscription id, tenant id, app id, and password. Here's what we did to get them: # login az login # create resource group az group create --name woolford --location westus # subscription ID az account show | jq -r '.id' xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx797 # tenant ID az account show | jq -r '.tenantId' xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx85d # create an application az ad app create --display-name woolford --homepage https://woolford.azurehdinsight.net --identifier-uris https://woolford.azurehdinsight.net --password myS3cret! # get the application ID az ad app list --query "[?displayName=='woolford']" | jq -r '.[0].appId' xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxa31 We tried to deploy the a cluster with Cloudbreak and received the following error: Failed to verify the credential: Status code 401, {"error":{"code":"InvalidAuthenticationToken","message":"The received access token is not valid: at least one of the claims 'puid' or 'altsecid' or 'oid' should be present. If you are accessing as application please make sure service principal is properly created in the tenant."}} We then attempted to create the service the service principal: az ad sp create-for-rbac --name woolford --password "myS3cret!" --role Owner (same outcome for --role Contributor) ... and received the following error: role assignment response headers: {'Cache-Control': 'no-cache', 'Pragma': 'no-cache', 'Content-Type': 'application/json; charset=utf-8', 'Expires': '-1', 'x-ms-failure-cause': 'gateway', 'x-ms-request-id': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxe01', 'x-ms-correlation-request-id': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxe01', 'x-ms-routing-request-id': 'EASTUS:20171017T025354Z:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxe01', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Date': 'Tue, 17 Oct 2017 02:53:53 GMT', 'Connection': 'close', 'Content-Length': '305'} The client 'awoolford@hortonworks.com' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxb67' does not have authorization to perform action 'Microsoft.Authorization/roleAssignments/write' over scope '/subscriptions/7xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx797'. Can you see what we're doing wrong? Is it possible to create a service principal for an application that I created (if I'm not an admin or co-admin)? If so, how?

awoolford · ‎08-25-2017

Zeppelin stores a lot of settings in interpreter.json The default dpi (dots per inch) for R plot is 72, hence the blurry plots. This value can be increased by adding a dpi property to Zeppelin's R render options. Search for the "zeppelin.R.render.options" key and add "dpi=300": "zeppelin.R.render.options": "out.format = 'html', comment = NA, echo = FALSE, results = 'asis', message = F, warning = F, dpi=300", You can see an example of the output below:

awoolford · ‎08-24-2017

R's ggplot2 is a popular and versatile data visualization package. The facet plots are a particularly useful method to identify characteristics and anomalies. We can break a scatter plot into facets simply by adding + facet_wrap(~myVar) to an existing plot. Here's an example plot (screenshot from RStudio): Zeppelin supports the R interpreter. Providing R is installed, we can run R commands inside a notebook cell: In this case, we ran a Hive query and created a facet plot. The question: it looks to me like Zeppelin has created a raster image that's been upscaled and therefore looks fuzzy. In RStudio, the plots look a lot crisper. I notice that there's an open JIRA for this: https://issues.apache.org/jira/browse/ZEPPELIN-1445 Is there a way to make ggplots look crisp in Zeppelin? Is there a way to render plots as PDF's, i.e. a vector format that doesn't get blurry when scaled, and then display those PDF's inside of the Zeppelin notebook?

awoolford · ‎08-21-2017

haveibeenpwned has downloadable files that contains about 320 million password hashes that have been involved in known data breaches. This site has a search feature that allows you to check whether a password exists in the list of known breached passwords. From a security perspective, entering passwords into a public website is a very bad idea. Thankfully, the downloadable files make it possible to perform this analysis offline. Fast random access of a dataset that contains hundreds of millions of records is a great fit for HBase. Queries execute in a few milliseconds. In the example below, we'll load the data into HBase. We'll then use a few lines of Python to convert passwords into a SHA-1 hash and query HBase to see if they exist in the pwned list. On a cluster node, download the files: wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-1.0.txt.7z wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-update-1.txt.7z wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-update-2.txt.7z The files are in 7zip format which, on CentOS can be unzipped: 7za x pwned-passwords-1.0.txt.7z 7za x pwned-passwords-update-1.txt.7z 7za x pwned-passwords-update-2.txt.7z Unzipped, the raw data looks like this: [hdfs@hdp01 ~]$ head -n 3 pwned-passwords-1.0.txt 00000016C6C075173C163757BCEA8139D4CC69CF 00000042F053B3F16733DFB83D431126D64331FC 000003449AD45B0DB016B895EC6CEA92EA2F91BE Note that the hashes are in all caps. Now we create an HDFS location for these files and upload them: hdfs dfs -mkdir /data/pwned-hashes hdfs dfs -copyFromLocal pwned-passwords-1.0.txt /data/pwned-hashes hdfs dfs -copyFromLocal pwned-passwords-update-1.txt /data/pwned-hashes hdfs dfs -copyFromLocal pwned-passwords-update-2.txt /data/pwned-hashes We can then create an external Hive table: CREATE EXTERNAL TABLE pwned_hashes ( sha1 STRING ) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/data/pwned-hashes'; Hive has storage handlers that enable us to query hive using the familiar SQL syntax, and benefit from the characteristics of the underlying database technology. In this case, we'll create an HBase backed Hive table: CREATE TABLE `pwned_hashes_hbase` ( `sha1` string, `hash_exists` boolean) ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,hash_exists:hash_exists', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.mapred.output.outputtable'='pwned_hashes', 'hbase.table.name'='pwned_hashes') Note the second column, 'hash_exists', in the HBase backed table. It's necessary to do this because HBase is a columnar database and cannot return just a rowkey. Now we can simply insert the data into the HBase table using Hive: INSERT INTO pwned_hashes_hbase SELECT sha1, true FROM pwned_hashes; In order to query this HBase table, Python has an easy to use HBase library called HappyBase that relies on the thrift protocol. In order to use this, it's necessary to start thrift: /usr/hdp/2.6.1.0-129/hbase/bin/hbase-daemon.sh start thrift -p 9090 --infoport 9095 We wrote a small Python function that takes a password, converts it to an (upper case) SHA-1 hash, and then checks the HBase `pwned_hashes` table to see if it exists: import happybase import hashlib def pwned_check(password): connection = happybase.Connection(host='hdp01.woolford.io', port=9090) table = connection.table('pwned_hashes') sha1 = hashlib.sha1(password).hexdigest().upper() row = table.row(sha1) if row: return True else: return False For example: >>> pwned_check('G0bbleG0bble') True >>> pwned_check('@5$~ lPaQ5<.`') False For folks who prefer Java, we also created a RESTful 'pwned-check' service using Spring Boot: https://github.com/alexwoolford/pwned-check We were surprised to find some of our own hard-to-guess passwords in this dataset. Thanks to @Timothy Spann for identifying the haveibeenpwned datasource. This was a fun micro-project.

awoolford · ‎08-11-2017

I'm curious to know how the Avro data was serialized. I suspect you're experiencing the same issue as me (see https://community.hortonworks.com/questions/114646/sam-application-unknown-protocol-id-12-received-wh.html) and possibly @Brad Penelli (see https://community.hortonworks.com/questions/114758/sam-application-kafka-source-fails.html).

awoolford · ‎07-24-2017

I created a minimal SAM application that reads Avro messages from Kafka and writes them to Druid: The Avro schema for the data in the Kafka topic was previously added to the schema registry: When I run the topology, the following error is thrown: com.hortonworks.registries.schemaregistry.serde.SerDesException: Unknown protocol id [12] received while deserializing the payload at com.hortonworks.registries.schemaregistry.serdes.avro.AvroSnapshotDeserializer.retrieveProtocolId(AvroSnapshotDeserializer.java:75) at com.hortonworks.registries.schemaregistry.serdes.avro.AvroSnapshotDeserializer.retrieveProtocolId(AvroSnapshotDeserializer.java:32) at com.hortonworks.registries.schemaregistry.serde.AbstractSnapshotDeserializer.deserialize(AbstractSnapshotDeserializer.java:145) at com.hortonworks.streamline.streams.runtime.storm.spout.AvroKafkaSpoutTranslator.apply(AvroKafkaSpoutTranslator.java:61) at org.apache.storm.kafka.spout.KafkaSpout.emitTupleIfNotEmitted(KafkaSpout.java:335) at org.apache.storm.kafka.spout.KafkaSpout.emit(KafkaSpout.java:316) at org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:236) at org.apache.storm.daemon.executor$fn__5136$fn__5151$fn__5182.invoke(executor.clj:647) at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) I took a peek at the code that's throwing the SerDesException. It seems that the first byte of the Avro inputstream is supposed to contain the protocol version/id: protected byte retrieveProtocolId(InputStream inputStream) throws SerDesException { // first byte is protocol version/id. // protocol format: // 1 byte : protocol version byte protocolId; try { protocolId = (byte) inputStream.read(); } catch (IOException e) { throw new SerDesException(e); } if (protocolId == -1) { throw new SerDesException("End of stream reached while trying to read protocol id"); } checkProtocolHandlerExists(protocolId); return protocolId; } private void checkProtocolHandlerExists(byte protocolId) { if (SerDesProtocolHandlerRegistry.get().getSerDesProtocolHandler(protocolId) == null) { throw new SerDesException("Unknown protocol id [" + protocolId + "] received while deserializing the payload"); } } The first byte of the Avro inputstream appears to be a form-feed character (ASCII code 12): Looking at the registry metastore, the only ID that exists is a 2: mysql> SELECT id, type, schemaGroup, name FROM registry.schema_metadata_info; +----+------+-------------+----------------------+ | id | type | schemaGroup | name | +----+------+-------------+----------------------+ | 2 | avro | Kafka | temperature_humidity | +----+------+-------------+----------------------+ 1 row in set (0.00 sec) I don't understand how the first byte of the Avro byte array could contain the ID for the schema registry unless it were created with a schema registry aware serializer. Can you see what I'm doing wrong?

awoolford · ‎06-06-2017

HBase has a convenient REST service, e.g. if we create some records in an HBase table: $ hbase shell hbase(main):001:0> create 'profile', 'demographics' hbase(main):002:0> put 'profile', 1234, 'demographics:age', 42 hbase(main):003:0> put 'profile', 1234, 'demographics:gender', 'F' hbase(main):004:0> put 'profile', 2345, 'demographics:age', 8 hbase(main):005:0> put 'profile', 2345, 'demographics:gender', 'M' hbase(main):006:0> scan 'profile' ROW COLUMN+CELL 1234 column=demographics:age, timestamp=1496754873362, value=42 1234 column=demographics:gender, timestamp=1496754880025, value=F 2345 column=demographics:age, timestamp=1496754886334, value=8 2345 column=demographics:gender, timestamp=1496754891898, value=M ... and start the HBase REST service: [root@hdp03 ~]# hbase rest start We can retrieve the values by making calls to the HBase REST service: $ curl 'http://hdp03.woolford.io:8080/profile/1234' -H "Accept: application/json" { "Row": [{ "key": "MTIzNA==", "Cell": [{ "column": "ZGVtb2dyYXBoaWNzOmFnZQ==", "timestamp": 1496754873362, "$": "NDI=" }, { "column": "ZGVtb2dyYXBoaWNzOmdlbmRlcg==", "timestamp": 1496754880025, "$": "Rg==" }] }] } I notice that the HBase column names and cell values returned by the HBase REST service are base64 encoded: $ python >>> import base64 >>> base64.b64decode("MTIzNA==") '1234' >>> base64.b64decode("ZGVtb2dyYXBoaWNzOmFnZQ==") 'demographics:age' >>> base64.b64decode("NDI=") '42' >>> base64.b64decode("ZGVtb2dyYXBoaWNzOmdlbmRlcg==") 'demographics:gender' >>> base64.b64decode("Rg==") 'F' That's great for machine-to-machine communication, e.g. a webservice, but isn't very user-friendly since base64 isn't human readable. Is there a simple way (e.g. header parameter, HBase property) to make the HBase REST service return human-readable JSON? I realize I could write my own service, but I'd rather re-use existing code/functionality if possible.

awoolford · ‎04-05-2017

show partitions mytable; Note: if you have more than 500 partitions, you may want to output to a file: $hive -e 'show partitions mytable;' > partitions ref: http://stackoverflow.com/questions/15616290/hive-how-to-show-all-partitions-of-a-table

Online	Offline
Last Visited	‎02-16-2018 03:53 PM

Member Since	‎07-05-2016 12:34 PM
Last Visited	‎02-16-2018 03:53 PM
Posts	42
Kudos received	32

Cloudera Community

Re: Nifi - Listen Syslog - getting failed @OnSched...

Re: R plots in Zeppelin look fuzzy

Re: How to see a table belong to which database? o...

Re: aggregate function with join gives wrong valu...

Re: Falcon - hbase

Re: Nifi - Listen Syslog - getting failed @OnSched...

Re: Nifi - Listen Syslog - getting failed @OnSched...

deploy Cloudbreak cluster on Azure without admin o...

Re: R plots in Zeppelin look fuzzy

R plots in Zeppelin look fuzzy

have you been pwned?

Re: SAM error: com.hortonworks.registries.schemare...

SAM application: "Unknown protocol id [12] receive...

get human readable JSON from the HBase rest servic...

Re: Checking hive partition