The Leslie Problem

In my last post I mentioned collecting random names by nationality as test data for a project. Doing my normal overthinking I started wondering about how authors could choose appropriate random names for characters and ran into discussion about “The Leslie Problem”.

Suppose you are a gender equality researcher and your data has actual names, but no gender assigned to them. You can generally assume that “Mark” is probably male (I’m ignoring all the LBGTQ+ issues) and “Susan” is probably female. There are gender neutral names that you can estimate male/female percentages from birth data where you do have gender data. The Leslie Problem refers to the fact that the male/female ratios of names change over time. “Leslie” used to have a higher male percentage, and now it is much more weighted towards females. (Same with the name “Shirley”).

This makes the research harder. Now you can’t look at current data, you need to (a) look at historical data in that geography and (b) you now need to know ages for your study data. Of course, if you only have names and not genders for your study data, you probably don’t have ages either.

It is difficult to get historical birth name:gender data for most places. In the US, the Social Security Administration does generate the necessary data (including by state). The French government is also pretty good and generates it by Department (geographic sub-regions). The University of Minnesota collects data from other countries, but you need a validated research need for access and my retired person curiosity doesn’t count. Other places are more hit and miss. The South Australian state government does a good job with downloadable data. The New South Wales state government provides the data back to 1950 in pdf form, which is a pain to disentangle. There is an interesting Wikipedia page, but more than half the links to the original data are broken. There are various baby name websites, but they generally only have ranking of the top 50 baby names in XYZ country and rankings of just the top names doesn’t help real statistical analysis.

In the course of trying to find sources of good data, I stumbled across a few different internet sites that purport to generate random identities. For example, besides generating random first and last names based on some unknown data for a country, they also generate age, addresses, social security numbers, phone numbers, mastercard numbers, blood type, employer, occupation, what the person will die of and at what age and other interesting tidbits. They generally claim that the street and city exist in the relevant country, but that the actual street number does not exist and the phone numbers and mastercard numbers are chosen from known invalid numbers. Of course, none of this is to be used for illegal purposes.

After generating a few random identities at each of those sites, one thing I noticed was that lack of ensuring the data was consistent. The person might be a part time oncologist in the health industry but their employer is “Karl’s Home Marketing”. One was hispanic but with very anglophone first and last names for both themselves and their parents (hey, maybe they were adopted). This is understandable. It takes a lot of work to ensure consistency of random data when you don’t get paid to do it.

Anyway, an interesting rabbit hole.

As usual, feel free to disagree using this contact link. My world view is a hypothesis, not a belief.

The Leslie Problem

See Also