Social Security Number Scanning
From College of Science IT Wiki
In the fall of 2006, Dean Vitter announced that the IT staff in the College of Science would be required to locate and remediate Social Security Numbers on all computer workstations and servers.
Contents |
Department of Biological Sciences
In the Department of Biological Sciences, we took the following approach.
Unix
Steve Wilson, our Unix admin, wrote a grep based shell script that he used to scan the Linux machines, including our email server. You may obtain it here:
Unix Scripts 4K, tested with Linux (version?) and OSX 10.4 Server
OS X
Using that as a basis, Steven Hunter next wrote an AppleScript front-end and made a few modifications, to make an OSX compatible version of the scanner. This package consists of three components:
- The main scanner which will scan either you Home directory or the entire system. (starting at "/" and following ALL folders, including anything mounted in "/Volumes"). The latter requires the user running the script have sudo permissions.
- A "Drop Scanner" which will scan any files, folders, or disks that are dropped onto it.
- A "Line Scanner" which works like the Drop Scanner but displays potential SSNs in the Terminal window. The purpose is to assist in locating SSNs in large files (such as email mailboxes) where you cannot simply remove the entire file.
Download them all here:
OSX Scanner 920K, Tested with 10.3.9 and 10.4+
Win32
In addition Eric Hassenplug used a Win32 port of GNU GREP with a vbscript front-end to develop a Win32 implementation. It allows you to scan your entire disk or a single file. Also included is a add-in for Outlook that will scan locally cached or Exchange email messages. Download it here:
Win32 Scanner 2MB, Tested with Windows XP and 2000 (SP4)
Notes
All of these scanners should locate SSNs inside any file that uses ASCII or ANSI standard text encoding. We have tested it with MS Word, Excel, PowerPoint, Access, Filemaker Pro, Eudora, and several other well known file formats. It will also unzip compressed files (when possible). Compressed file support is limited to ZIP on Win32 and to the formats zcat and bzcat support. It also will NOT find SSNs inside PDF documents. In the OSX and Win32 versions we tried to eliminate as many file types as possible to prevent false positives.
Department of Mathematical Sciences
The scanning tool developed in Math has an emphasis on giving us enough information to track, judge, and improve the effectiveness of our scanning.
The scanner can use both filename-based and content-based rules to classify files by type, and use the type to determine whether to scan the file raw (which works for any file type where SSNs would show up in the raw bytes), or through a 'filter' (such as decompressing a compressed file or text-extracting PDF), or not at all (for, e.g., audio files). It is easy to edit rules and integrate other filters. The rules for type classification are in an XML file from the freedesktop.org shared-mime-info project, with over 500 types. They are bundled with the scanner to get consistent results across platforms without depending on each platform's 'file' command, and so any rule additions or edits can be made in one place.
The scanner's main output is a log (human-readable in a pinch, but XML for easy processing) that includes a record for every file seen, how it was classified, what filter was used if it was scanned, the rule responsible if it was not, and the positions of hits in scanned files. The log can be the archival record of a scan.
Several stock reports run from the log, summarizing files with hits, files scanned, filtered, or skipped by type, by rule, and by filter. Other ways of looking at the data can be coded up at will in XSLT or XQuery--both languages are supported by the Saxon engine, bundled with the tool. XQuery can also be used interactively to explore big results, and several functions specific to SSN scanning are usable in queries.
Some simple GUI elements can let a user drill down into results for specific directories or files, and view hits in the surrounding context retrieved from the files. Uses where a user can select files/hits and click to update a database of reviewed/remediated files are also possible. XQuery functions are provided for presenting a graphical (Swing) tree widget for exploring scan results or the results of any query; the user can select any combination of nodes displayed, and the selected nodes are returned as the function results. Callback functions written in XQuery can be supplied to control what is displayed for any node or for a mouse-hover tooltip over any node.
Unix
The scanner is a Perl script requiring Perl 5.8.1 or later. The few Perl modules used that are not included in Perl 5.8.1 are bundled (with a few other necessary files) in a lib/ directory to avoid depending on what modules are installed on a given machine. Several filters do require other software present on the machine, as noted in the README.
Scanning speed (measured on a 2.2 GHz Opteron MP) hits about 3.5 MB/sec on collections of variously-sized files that require no filtering. Speeds of different filters vary greatly (PDF text extraction is pretty slow) but our typical mixes of files go about 1.2 to 1.3 MB/sec. These timings include all overheads of finding, classifying, opening and, if needed, filtering the files.
Reporting and querying use the Saxon engine, which requires Java. If it's more convenient, reports and queries can be done on a machine with Java using logs of scans on other machines.
OS X
No Apple-specific front-end has (yet) been developed, but the script runs on OS X as for Unix. It behaves as a standard command-line Unix tool to simplify any front-end.
Windows
ActiveState Perl is available for Windows. The scanner script tries to use portable Perl constructs and should not be hard to get running on Windows, but that has not yet been done. Various filters rely on other software (such as ghostscript for PDF text extraction) that will need to be present too.
Notes
- The patterns that will be considered 'hits' are detailed in the man page. They are chosen to be sensitive to the ways we see SSNs in real files, while holding false positives down. The scanner knows the rules for valid SSNs from the Social Security Administration.
- Some applications produce PDFs that still can't be text-extracted with the techniques the filter uses.
- Our Computer Committee requested a way to avoid including the actual hit strings in the log file, so the scanner has three choices for handling these: preserve, omit, or hash.
- Preserve keeps the original strings in the log, which should then be treated as sensitive data.
- Omit replaces each digit with X, preserving punctuation. Reporting breakdowns by type of punctuation are possible, but information on which hits are distinct or duplicate is lost.
- Hash replaces each SSN with a one-way hash, preserving punctuation. If you might be interested in how many hits in a file are distinct or repeated, this form preserves that information.
- Filters can be developed (in XSLT or XQuery) to postprocess a scan log and filter out common false positives. The filters can refer to file names, file types, and context surrounding a hit, using any of the regular-expression or other available XQuery functions. Filters can be added and updated and run again on existing logs without repeating an entire scan. A sample filter for common hits in Apple Mail files is included. These filters are where our current effort is going.
- At a recent Academic IT mini-retreat, ECN presented some work on SSN scanning that included a large collection of patterns they've developed for post-filtering false positives. They don't seem ready yet to share those patterns, but that's obviously a great investment of effort that nobody would want to duplicate, and would complement our tool.
- The scanner's memory demand is independent of number of files scanned, so it can scan any size of filesystem and produce a log of any length. Reports and queries, however, use the Saxon engine, which loads the log into memory and so is limited in the size of log it can process. So, it may be helpful to scan a large filesystem in passes on different subtrees, to limit the size of any one log. Or, a SAX-based streaming filter could be used to split the largest logs into well-formed smaller ones. Or, running reports and queries on a 64-bit machine after downloading Sun's 64-bit Java runtime pretty much moots the issue. (OS X Leopard on 64-bit hardware is expected to offer the same benefit.)
Obtaining
Development is currently taking place in Math's subversion repository, but the College of Science repository contains a (sometimes recent) snapshot. This file is a POSIX PAX archive. (If your OS lacks the pax command, tar also works to unpack it.)
Updates to the snapshot may introduce new interesting features described in the change history. It may be worthwhile to check with Math to learn if any good stuff isn't in the current snapshot.
Note: as of rev. 37 of the snapshot, the XQuery tool no longer retains locally-declared functions across &eval, &run, or &pull. If you have developed anything that relied on that feature, rev. 36 is the most recent snapshot you will want to use until you can accommodate the change.

