PDA

View Full Version : SQL Gurus: Help to "anonymize" some data


rudeney
10-07-09, 02:38 PM
I am doing some consulting work for a company that wants to showcase its new software at a trade show. They have a customer who has graciously let them use their live data, but they insist on having all the names changed for privacy concerns (of course).

Does anyone have any great ideas on this? It's a MS-SQL 2008 database with "millions" of rows, so I'm definitely looking at an automated script to do this as opposed to manual entry. I did this long, long ago with some old ISAM data where we just "skewed" first and last names. I'd rather do something more sophisticated since this is consumer financial data and we don't want someone saying, "Oh, look, Mr. Brott's DirecTV bill is in collections!" Any good ideas?

Marlin Guy
10-07-09, 02:43 PM
"Oh, look, Mr. Brott's DirecTV bill is in collections!"

Isn't that common knowledge anyway? :lol:

Fontano
10-07-09, 02:58 PM
My recommendation:

Step 1; Identify the unique number of First and Last Names (remove Middle Names)
Step 2: Identify a proportionate number of random names, for each of those.

Example: If you have 500 unique first names, identify 10% unique new first names, using say maybe the baby name generator. Do the same for Last Names.

Step 3: Replace all the names randomly picking a new first name and Last Name.
Making sure that the names don't match the originals, or share even the first letters.

You can add to those random lists, by adding common pre-fix and suffix endings, like Mc, O', -ski and so on.

That way you will get names that are still logical, and will make sense in displays.

For middle names, just do a random middle initial.

rudeney
10-07-09, 03:16 PM
My recommendation:

Step 1; Identify the unique number of First and Last Names (remove Middle Names)
Step 2: Identify a proportionate number of random names, for each of those.

Example: If you have 500 unique first names, identify 10% unique new first names, using say maybe the baby name generator. Do the same for Last Names.

Step 3: Replace all the names randomly picking a new first name and Last Name.
Making sure that the names don't match the originals, or share even the first letters.

You can add to those random lists, by adding common pre-fix and suffix endings, like Mc, O', -ski and so on.

That way you will get names that are still logical, and will make sense in displays.

For middle names, just do a random middle initial.

Interesting idea. I have about 20,000 unique first and last names. My biggest concern is that some will be so "unique" as to be identifiable. We may just have to take some time to filter out the super-rare names.

Fontano
10-07-09, 03:27 PM
Interesting idea. I have about 20,000 unique first and last names. My biggest concern is that some will be so "unique" as to be identifiable. We may just have to take some time to filter out the super-rare names.

Can you split the lists:

- Unique First Name
- Unique Last Name

You could use the First Names after eliminating the very unique ones (look for ones with a length of more then say 7 characters and less then 3)

The with the last names, see if you can find a list on google or something like that of the most common last names out there.

rudeney
10-07-09, 03:43 PM
Can you split the lists:

- Unique First Name
- Unique Last Name

You could use the First Names after eliminating the very unique ones (look for ones with a length of more then say 7 characters and less then 3)


Good! I will do that.

The with the last names, see if you can find a list on google or something like that of the most common last names out there.

Yeah, I found some lists with common names. Really, I could make them all "Smith", or even munge them with random characters so they become unpronounceable, but I wanted a more creative solution so the data looks real.

wilbur_the_goose
10-08-09, 10:19 AM
Do what Fontano said, but take it another step

Don't use randomness - it's tough to test if you have multiple test databases.

Take a number that's unique to each row (the primary key?) and divide that number by the number of rows in your name table. Use the remainder of the division (the MOD function) to serve as a PK to read a row from the name table. Then update the master table with the name just read.

Repeat for first name.

If you have a column that represents gender, you can have a male first name table and a female first name table.

Also be careful if you have non-natural entities in your table - you don't want "Acme Corp" to be depersonalized to "Mary Brown"!

That will give you the same depersonalized name for a given row in every test database generated from your live data. And it's not a name that can be reverse engineered.

PS - RODNEY - I'd be careful working for this client. They may actually be breaking federal laws (GLBA type laws) by letting this information outside their company.

steve053
10-08-09, 11:05 AM
Another simple idea is to change the font on the name column to wingdings or some other symbol based font.

Or

Concatonate second letter of First Name (or Company Name) & the first letter of the same name & the Length of the full name.

Mary Jones = Am10
Mark Thompson = Am13
DBSTalk = Bd7

rudeney
10-08-09, 12:51 PM
Do what Fontano said, but take it another step

Don't use randomness - it's tough to test if you have multiple test databases.

Take a number that's unique to each row (the primary key?) and divide that number by the number of rows in your name table. Use the remainder of the division (the MOD function) to serve as a PK to read a row from the name table. Then update the master table with the name just read.

Repeat for first name.

If you have a column that represents gender, you can have a male first name table and a female first name table.

Also be careful if you have non-natural entities in your table - you don't want "Acme Corp" to be depersonalized to "Mary Brown"!

That will give you the same depersonalized name for a given row in every test database generated from your live data. And it's not a name that can be reverse engineered.


Thanks - more good ideas!


PS - RODNEY - I'd be careful working for this client. They may actually be breaking federal laws (GLBA type laws) by letting this information outside their company.

I'm not very worried. My client is a small software company that has been in business for a very long time. I was an employee many years ago (almost 20) and they have hired me back as a consultant to help them as they are struggling. They have a legacy COBOL software product and are trying to get a new modern (.NET/SQL) product off the ground. Their customers are mostly privately-owned companies that, as best as I can say without going into great detail "trade in data". One of those customers gave us their data to convert from legacy to the new product with permission to use it for demonstration purposes as long as we change it so it cannot be recognized. We won't be giving this data to anyone else, just displaying it in demos and such. Once the names are anonymized (and I am doing the same with other non-name data), it will be just random "pretend" information.

wilbur_the_goose
10-08-09, 06:58 PM
Rodney,
If that data is from a financial services firm, that firm is quite possibly breaking the law by giving your firm the data. Your firm is fine, but that other company should not be giving you PII.

(I work in the financial svcs world, and if anybody did this (give a software company PII data) at my company, they'd be in very deep kaka if not let go with malice)