Field notes on the Web: statistics

Showing posts with label statistics. Show all posts

Thursday, December 13, 2018

Common passwords are bad ... by definition

It's that time of the year again, time for the annual lists of worst passwords. Top of at least one list: 123456, followed by password. It just goes to show how people never change. Silly people!

Except ...

A good password has a very high chance of being unique, because a good password is selected randomly from a very large space of possible passwords. If you pick your password at random from a trillion possibilities*, then the odds that a particular person who did the same also picked your password are one in a trillion, the odds that one of a million other such people picked your password are about one in a million, as are the odds that any particular two people picked the same password. If a million people used the same scheme as you did, there's a good chance that some pair of them accidentally share a password, but almost certainly almost all of those passwords are unique.

If you count up the most popular passwords in this idealized scenario of everyone picking a random password out of a trillion possibilities, you'll get a fairly tedious list:

1: some string of random gibberish, shared by two people
2 - 999,999: Other strings of random gibberish, 999,998 in all

Now suppose that seven people didn't get the memo. Four of them choose 123456 and three of them choose password. The list now looks like

1: 123456, shared by four people
2: password, shared by three people
3: some string of random gibberish, shared by two people
4-999,994: Other strings of random gibberish, 999,991 in all

Those seven people are pretty likely to have their passwords hacked, but overall password hygiene is still quite good -- 99.9993% of people picked a good password. It's certainly better than if 499,999 people picked 123456 and 499,998 picked password, two happened to pick the same strong password and the other person picked a different strong password, even though the resulting rankings are the same as above.

Likewise, if you see a list of 20 worst passwords taken from 5 million leaked passwords, that could mean anything from a few hundred people having picked bad passwords to everyone having done so. It would be more interesting to report how many people picked popular passwords as opposed to unique ones, but that doesn't seem to make its way into the "wow, everyone's still picking bad passwords" stories.

From what I was able to dig up, that portion is probably around 10%. Not great, but not horrible, and probably less than it was ten years ago. But as long as some people are picking bad passwords, the lists will stay around and the headlines will be the same, regardless of whether most people are doing a better job.

(I would have provided a link for that 10%, but the site I found it on had a bunch of broken links and didn't seem to have a nice tabular summary of bad passwords vs other passwords from year to year, so I didn't bother)

*A password space of a trillion possibilities is actually pretty small. Cracking passwords is roughly the same problem as the hash-based proof-of-work that cyrptocurrencies use. Bitcoin is currently doing around 100 million trillion hashes per second, or a trillion trillion hashes every two or three hours. The Bitcoin network isn't trying to break your password, but it'll do for estimating purposes. If you have around 100 bits of entropy, for example if you choose a random sequence of fifteen capital and lowercase letters, digits and 30 special characters, it would take a password-cracking network comparable to the Bitcoin network around 400 years to guess your password. That's probably good enough. By that time, password cracking will probably have advanced far beyond where we are and, who knows, maybe we'll have stopped using passwords by then.

Wednesday, November 4, 2009

60 Minutes and the MPAA: Part IV - Error bars

In the 60 Minutes piece I've been referencing, A-list director Steven Soderbergh drops the oft-quoted figure of $6.1 billion per year in industry losses. This figure comes from a 2006 study by consulting firm L.E.K. It's easy to find a summary of this report. Just google "video piracy costs" and up it comes. Depending on your browser settings, you may not even see the rest of the hits, but most of the top ones are repeats or otherwise derived from the L.E.K study. And you didn't need to see anything else anyway, did you?

So ... $6.1 billion. Let's assume for the moment that the figure is relevant -- more on that in the next post. How accurate is it?

One of the handful of concepts I retained from high school physics, beyond Newton's laws, was that of significant digits, or "sig digs" as the teacher liked to call them. By convention, if I say "6.1 billion", I mean that I'm confident that it's more than 6.05 billion and less than 6.15 billion. If I'm not sure, I could say 6 billion (meaning more than 5.5 billion and less than 6.5 billion).

Significant digits are just a rough-and-ready convention. If you're serious about measurement you state the uncertainty explicitly, as "6.1 billion, +/- 300 million". My personal opinion is that even if you're not being that rigorous, it's a bad habit to claim more digits than you really know, and a good habit to question anything presented like it's known to an unlikely number of digits.

The point of all this is that precise results are rare in the real world. Much more often, the result is a range of values that we're more or less sure the real value lies in. For extra bonus points, you can say how sure, as "6.1 billion, plus or minus 300 million, with 95% confidence".

From what I can make out, L.E.K. is a reputable outfit and made a legitimate effort to produce meaningful results and explain them. In particular, they didn't just try to count up the number of illegal DVDs sold. If I buy an illegal DVD but go and see the movie anyway, or I never would have seen the movie at all if not for the DVD, it's hard to claim much harm. So L.E.K. tried to establish "how many of their pirated movies [viewers] would have purchased in stores or seen in theaters if they didn't have an unauthorized copy". They did this by surveying 17,000 consumers in 22 countries, doing focus groups and applying a regression model to estimate figures for countries they didn't survey. (This is from a Wall Street Journal article on L.E.K. web site and from the "methodology" section of the summary mentioned above).

On average, they surveyed about 800 people per country, presumably more in larger countries and fewer in smaller. That's enough to do decent polling, but even an ideal poll typically has a statistical error of a few percent. This theoretical limit is closely approached in political polls in countries with frequent elections, because it's done over and over and the pollsters have detailed knowledge of the demographics and how that might effect results. They apply this knowledge to weight the raw results of their polling in order to compensate for their sample not being completely representative (for example it's weighted towards people who will answer the phone when they call and are willing to answer intrusive questions).

For international market research in a little-covered subject, none of this is available. So even if you have a reasonably large sample, you still have to estimate how well that sample represents the public at large. There are known techniques for this sort of thing, so it's not a total shot in the dark, but I don't see anyway you can assume anything near the familiar "+/- 3%" margin. At a wild guess, maybe more like 10-20%, by which I mean you're measuring how the population at large would answer the question, and not what they would actually do, with an error of -- who knows but let's say -- 10-20%. More than the error you'd assume by just running the sample size and the population size through the textbook formula, anyway.

All of this is assuming that people won't lie to surveyors about illicit activity, and that they are able to accurately report what they might have done in some hypothetical situation. Add to that uncertainties in the model for estimating countries not surveyed and the nice, authoritative statement that "Piracy costs the studios $6.1 billion a year" comes out as "Based on surveys and other estimates done in 2006, we think that people who bought illegal DVDs might have spent -- I'm totally making this up here -- somewhere between $4 billion and $8 billion on legitimate fare that year instead, but who really knows?"

Now $4 billion, or whatever it might really be, is still serious cash. The L.E.K. study at the least makes a good case that people are spending significant amounts on pirated goods they might otherwise have bought from studios. I'm not disputing that at the moment. Rather, I'm objecting to a spurious air of precision and authority where very little such exists. More than that, I'm objecting to an investigative news program taking any such key figure at face value without examining the assumptions behind it or noting, for that matter, that it was commissioned by the same association claiming harm.

And again, this is still leaving aside the crucial question of relevance.

Saturday, October 17, 2009

The death of email: Huh?

The office building where I work has hot and cold running news on a screen in the hallway (except when the screen shuts off from overheating in the closed wooden frame that holds it). A couple of days ago it showed a story on the reported death of email. Cause of death: online "member communities". Hmm ... communities with members in them ... must be one of those web 2.0 things.

That seemed like good Field Notes material, so I went searching. Turns out email has died a couple of times already, for example in 2007, and in 2006, and in 2004 ... the trail gets a bit harder to follow past that due to link rot, but a bit more googling indicates that the idea has been around since the turn of the millennium. Naturally enough, the end of email has been predicted for even longer -- here's a page from 1998 debunking the idea that Microsoft (of course) was going to bring about its demise. [Googling "death of email" is still good for a number of hits, including at least one talking about an email "renaissance" -- hmm ... I'd be worried about that last one --D.H. Dec 2015].

Clearly a bit of skepticism is in order.

What's the evidence this time? It seems to be a recent Nielsen study reporting that more people belong to "member communities" than have email. OK ... does the study mention anything at all about trends in email usage? Of course not. Why should it? It's a study of "member communities" that happens to make a particular comparison of user bases in passing. Does Nielsen claim email is dying? Of course not.

Further, the "member communities" designation looks fairly broad. It doesn't just include the usual social networking suspects like FaceBook, MySpace, Orkut, LinkedIn and so on. It also includes blogging. Yep, if you're reading this blog, you're killing email. Like I said, take it with a grain of salt.

The next obvious question is, what actually is happening with email? That question doesn't seem to yield so quickly to googling, and Wolfram Alpha unfortunately lists it as a "future topic". One difficulty is that the measurement tends to be done by people trying to do email marketing, who probably have a predisposition to aim high. I'm not so much interested in whether there's more or less spam than before. If spam went away tomorrow leaving only actual email and some small residue of legitimate marketing, that would indeed be a renaissance, not a death.

In any case, I'm pretty sure email isn't dead, and I'm even more sure that the Nielsen report has no bearing on the question.

One interesting bit did catch my attention, though. Google's Orkut seems to be doing quite well in its target market of Brazil [Orkut has since been shut down --D.H. Dec 2015].

Sunday, August 23, 2009

Some highlights of the year

Here are some tidbits gleaned from Google Analytics:

The top five pages are
1. The main Field Notes page. I think that's generally people dropping back in to check the site.
2. Go ahead and talk to strangers, about the intriguing and popular Omegle anonymous chat server.
3. Information age: Not dead yet, a reply to Joe Andrieu's contention that the Information Age is behind us now, has been picking up steady traffic from searches like "When did the information age begin?" ever since. It doesn't answer the question. Rather, it argues that the question itself is ill-defined, which I still think is a valid response.
4. "Hackers crack SSL", one of a series of articles in which I tried to track down what actually happened and discovered an interesting tale. I suspect that at least some visitors were looking for advice on how to crack SSL, in which case they would likely have come away disappointed.
5. Now what happened to my bookmarks?, another post likely to disappoint searchers. It's about why I don't seem to use the bookmarks feature on by browser much at all, not about how to recover lost bookmarks, so I've added a note at the top with links to several more useful pages and/or searches.
The top five searches are
1. omega talk to strangers (see above)
2. how to guess someones password on worldscape. It turns out that the words in question appear in close enough proximity for Google to think I might have something to say on the topic. I don't.
3. field notes on the web, fairly enough
4. what happened to my google bookmarks (see above)
5. when did the information age begin (likewise)
Other searhces that caught my eye, for whatever reason

how many threes are in a dozen?
powerset -"power set" -powersets
hammock kenotic torrent
"poach tickets"
"david hull" -aerosmith +microsoft
"david hull" -aerosmith -humanities
"what's a concept"
accelerometer, disable the device,driving
al gore lisp
all human knowledge how much information
hand ciphers touch-tone phone
how can the concept of a wovel be used in ecommerce systems
j. k. rowling lisp
photo fedex man eating lunch
quick before you change your mind david hull
salt lake city trip checked into hotel
was kathleen antonelli interested in music
what is that chat room called? omega something? chat with strangers?
why did the information age occur

["How many threes are in a dozen?" has a particularly interesting history with this blog ... I'll let you do the searching, particularly since results may change from time to time ... --D.H. Dec 2015]

Friday, May 23, 2008

Crowdsourcing crime statistics

A while ago I ran across a dispute concerning Sitefinder. Sitefinder provides access to a database of cell tower locations. Unfortunately, the database is incomplete, as not all providers have agreed to provide data for it. In the original post, I suggested this was a job for crowdsourcing, though I don't know whether anything of the sort actually happened.

However, someone has put into practice the general concept of crowdsourcing a parallel database when the official information is not readily available. Vasco Furtado's site, wikicrimes.org, uses pushpins on a Google map to chart crime in Brazil. Anyone can add a pushpin, or confirm (or disconfirm) a crime already reported.

Judging from the site, either they haven't hit critical mass yet or Brazil's crime rate is exceptionally low. Even if there were more data points, one would have to take it with a grain of salt, if only because some areas may happen to have more wikicrimes contributors and different areas have different rates of internet penetration. Of course, one should take any statistics, official or otherwise, with a grain of salt. At the very least, it's an interesting experiment.

Field notes on the Web

Thursday, December 13, 2018

Common passwords are bad ... by definition

Wednesday, November 4, 2009

60 Minutes and the MPAA: Part IV - Error bars

Saturday, October 17, 2009

The death of email: Huh?

Sunday, August 23, 2009

Some highlights of the year

Friday, May 23, 2008

Crowdsourcing crime statistics

About Me

My other blog

People following Field Notes

FeedBurner

Search This Blog

Blog Archive

Reader Picks

Labels

Search This Blog

Pages

Hyperlinks vs. the web

Report Abuse