Curly and I have been playing a game called “How much information can you get with no information?” A fun field to play this on is Wikipedia’s top 1000 pages list, which provides a list of Wikipedia’s articles by pageviews on a specific date.
The information I’m trying to get from such a list is which pages are RECENTLY popular and why. After all, this list counts overall popularity, not comparative popularity. There are plenty of Wikipedia articles which are on this list often simply because they’re popular, not because they’ve attracted recent interest. I don’t want those.
We thought about comparing list inclusion over time, but doing that on the fly gets tangled in a hurry. So we tried z-scores again — getting 5 days’ worth of recent pageviews for the top 100 articles and generating z-scores against the list date’s pageviews, then filtering the list to include only those articles with a recent z-score of over 1.5.
I was surprised how well that works to filter out constantly-popular pages because as a filter method it seems kind of basic. We explored using the positive/negative z-score patterns from all five days of views to determine popular pages, but using a single z-score worked better.
Once I had a filtered list of popular pages, I sent the Wikipedia page titles to a News API date-bounded search. The News API lets you search by just title, keywords, and descriptions, so you can do some specific searching in a decent amount of data. And now I’ve got a nice little Wikipedia-based news thing to keep me informed.
Choose a date and get a list of (hopefully) new/recent entrants to the top 100 most popular Wikipedia pages for that date, along with a list of relevant headlines for that topic from that day and an article. Click on the headline and you get additional information, including keywords and the source of the article.
Specify a date and get a roundup of important people, places, and things, and long with news stories relevant to both the topic and the date. I like how this game turned out!