Created on 08-21-2017 04:05 PM
haveibeenpwned has downloadable files that contains about 320 million password hashes that have been involved in known data breaches.
This site has a search feature that allows you to check whether a password exists in the list of known breached passwords. From a security perspective, entering passwords into a public website is a very bad idea. Thankfully, the downloadable files make it possible to perform this analysis offline.
Fast random access of a dataset that contains hundreds of millions of records is a great fit for HBase. Queries execute in a few milliseconds. In the example below, we'll load the data into HBase. We'll then use a few lines of Python to convert passwords into a SHA-1 hash and query HBase to see if they exist in the pwned list.
On a cluster node, download the files:
wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-1.0.txt.7z wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-update-1.txt.7z wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-update-2.txt.7z
The files are in 7zip format which, on CentOS can be unzipped:
7za x pwned-passwords-1.0.txt.7z 7za x pwned-passwords-update-1.txt.7z 7za x pwned-passwords-update-2.txt.7z
Unzipped, the raw data looks like this:
[hdfs@hdp01 ~]$ head -n 3 pwned-passwords-1.0.txt 00000016C6C075173C163757BCEA8139D4CC69CF 00000042F053B3F16733DFB83D431126D64331FC 000003449AD45B0DB016B895EC6CEA92EA2F91BE
Note that the hashes are in all caps.
Now we create an HDFS location for these files and upload them:
hdfs dfs -mkdir /data/pwned-hashes hdfs dfs -copyFromLocal pwned-passwords-1.0.txt /data/pwned-hashes hdfs dfs -copyFromLocal pwned-passwords-update-1.txt /data/pwned-hashes hdfs dfs -copyFromLocal pwned-passwords-update-2.txt /data/pwned-hashes
We can then create an external Hive table:
CREATE EXTERNAL TABLE pwned_hashes ( sha1 STRING ) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/data/pwned-hashes';
Hive has storage handlers that enable us to query hive using the familiar SQL syntax, and benefit from the characteristics of the underlying database technology. In this case, we'll create an HBase backed Hive table:
CREATE TABLE `pwned_hashes_hbase` ( `sha1` string, `hash_exists` boolean) ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,hash_exists:hash_exists', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.mapred.output.outputtable'='pwned_hashes', 'hbase.table.name'='pwned_hashes')
Note the second column, 'hash_exists', in the HBase backed table. It's necessary to do this because HBase is a columnar database and cannot return just a rowkey. Now we can simply insert the data into the HBase table using Hive:
INSERT INTO pwned_hashes_hbase SELECT sha1, true FROM pwned_hashes;
In order to query this HBase table, Python has an easy to use HBase library called HappyBase that relies on the thrift protocol. In order to use this, it's necessary to start thrift:
/usr/hdp/126.96.36.199-129/hbase/bin/hbase-daemon.sh start thrift -p 9090 --infoport 9095
We wrote a small Python function that takes a password, converts it to an (upper case) SHA-1 hash, and then checks the HBase `pwned_hashes` table to see if it exists:
import happybase import hashlib def pwned_check(password): connection = happybase.Connection(host='hdp01.woolford.io', port=9090) table = connection.table('pwned_hashes') sha1 = hashlib.sha1(password).hexdigest().upper() row = table.row(sha1) if row: return True else: return False
>>> pwned_check('G0bbleG0bble') True >>> pwned_check('@5$~ lPaQ5<.`') False
For folks who prefer Java, we also created a RESTful 'pwned-check' service using Spring Boot: https://github.com/alexwoolford/pwned-check
We were surprised to find some of our own hard-to-guess passwords in this dataset.
Thanks to @Timothy Spann for identifying the haveibeenpwned datasource. This was a fun micro-project.