Finding the “Main Characters” in Wikipedia Categories

Finding the “Main Characters” in Wikipedia Categories

If I gave you a list of twenty people from Wikipedia and told you to list them in order of cultural prominence without consulting an external reference, how would you do it? You’d probably start by identifying people you know. You’d use your knowledge to sort them as best you could. But what about the rest of the people? Maybe you’d try to do some kind of analysis based on general knowledge. Maybe you’d list them randomly or in alphabetical order. You would have no knowledge to complete the task.

Now, what if I gave you the same list but I also included with each person’s name their average monthly Wikipedia page views? In this case, you’d be able to create a better list. Why? Because you have access to the historical interest of other humans. That’s what page views are: records of human historical interest in a page. Humans who are interested in and knowledgeable of things that you are not. Why not take advantage of those records and see how they can shape your exploration of giant data collections?

WikiCat Main Characters ( https://wikitwister.com/wmc/ ) identifies the humans in a Wikipedia category, divides them into three tiers based on average monthly page views, and then determines each tier’s top three biggest movers by page views for that month. The category’s “Main Characters” are then presented with date-bounded Google Web and Google News searches so you can get external information specific to high-traffic days. Here’s how it works.

(Note: apparently my WordPress theme broke while I was writing this article. I tried to upload a screenshot about 2/3 of the way through the article, which failed. My Web host checked it and said there a PHP problem but if I switched to the default theme that would fix it. Unfortunately that solution breaks everything else. The last bit of the article lacks screenshots and I guess I need to review this theme. Apologies.)

A screenshot of WikiCat Main Characters. A panel of nine information cards is presented Brady Bunch style. The category was "Rappers from North Carolina." People listed include the rapper Mavi, TiaCorine, Rapsody, Malik Turner, and Fred Durst.

Getting Started With WikiCat Main Characters

WikiCat Main Characters starts with a keyword search for Wikipedia categories. Categories with people in them can be found via several searches: job titles, locations, or people from if you’re exploring a particular location. If you’re trying to focus on contemporary people, try searching for 21st century and then keywords of interest. You can also use a country if you want. I’ll try a search of 21st century Canadian.

A screenshot showing a keyword search of Wikipedia categories for "21st Century Canadian." There are jillions of them, including mentions of lawyers, musicians, people, pianists, lawyers, and drummers.

As you can see this gets you a lot of results (though WMC limits them to 20.) I’m going to add business to my search.

Adding business cuts down the results a lot. I think I’ll look at 21st century Canadian businesspeople. Note that each result has a page count with it. These are not exactly right (getting them exactly right would require API calls) but they’re almost right. Categories with lots of people take a while to process; the approximate count is there so you don’t get surprised by accidentally picking a category with 1500 people in it.

Once you choose a category you’ll be asked to choose a month and a date to analyze the page view counts for. This only goes back to January 2017 because Wikipedia has not been collecting page view counts this whole time. You can do incomplete months but it doesn’t work as well, so I’ll analyze the page views for June 2024. Once I’ve entered 6 / 2024 I click Find People and Analyze Programs and away I go.

Screenshot of WikiCat Main Characters. We're doing a keyword search for Wikipedia categories and have discovered the 21st Century Canadian Businesspeople category, with roughly 138 people in it.

Once you’ve clicked, WCM will pull the pages, determine which pages in the category refer to humans (it’s looking for certain Wikidata properties in the page), and then start analyzing the page views for those humans. This takes a decent number of API calls so long lists of people take a few minutes to run; WCM has rate-limiting to respect the Wikipedia API. The program will keep you updated as it processes.

Screenshot of WikiCat Main Characters in action. The 140 pages in the 21st Century Canadian Businesspeople have been determined and the program is posting updates in the format of "Processed 1 of 140 pages, processed 11 of 140 pages," etc.

Once the program has finished processing, the update notices vanish and are replaced by a listing of the active “main characters” for that month in that category. Since the low-medium-high view tier limits are dynamically-generated, the limits are shown at the top of the results along with the final count of humans (in this case 108.)

Screenshot of WikiCat Main Characters' result of searching 21st-century Canadian businesspeople in June 2024.

Each person’s listing show the average monthly page views for the person over the course of a year, the page views for the specified month, the percentage difference, and the top three days with highest page views along with date-bounded Google News and Google Web searches for that day.

This is a good example search because it shows you the many different ways people can end up with a Wikipedia page view spike. Let’s look at three examples: Glen Sather, Shay Mitchell, and Sam Panopoulos.

Glen Sather

Glen Sather is an ice hockey guy and has been for many decades. As you can see, his monthly views in June had a huge increase above average. The busiest day, June 26, has more views than his normal monthly average. With that much of a spike in page views I’d expect it would be pretty easy to determine the reason by clicking on the Google News search for that day. And it was: Mr. Sather retired on June 26th.

Sometimes it’s that simple; there’s a lot of news about someone on a particular day and that leads to a surge of visits to their Wikipedia page. Sometimes a story might evolve over a few days and it takes an extra step to find where it started. Such is the case with Shay Mitchell.

When you look at Ms. Mitchell’s busiest page view days in June, you see that June 22nd is her busiest page view day. But if you look at a Google News search for her name on that day, you won’t find much. The second-busiest day, however, is June 21, indicating that the story might have started there. If you do a Google News search for her on that day, you’ll see a couple of stories referring to an online controversy where Ms. Mitchell referred to her mother as Spanish in an interview, when before she had described her as Filipino. If you do a Web search for June 21, you’ll get a TikTok as your first result, followed by lots of social media responses from platforms including Twitter, Facebook, and YouTube.

Finally, let’s look at the listing for Sam Panopoulos. Mr. Panopoulos is part of the medium page view tier. Sometimes the medium and low tiers can’t tell you too much because the average page views are very low to start with. If a page averages 10 views a month and gets two extra views, that’s quite a bump percentage wise but isn’t evidence of a huge shift in attention. In this case, however, Mr. Panopoulos had a monthly average of 2955 views and a June page view count of 4802. That’s a 62.51% bump. But you won’t find his name in a Google News search for June 13, his day with the highest page views. In fact, if you look at his Wikipedia article you’ll see that he passed away in 2017. Why is his name popping up now?

You can find the answer in a Web search date-restricted to June 13. A post was made on Twitter about Hawaiian Pizza (pizza with pineapple on it) being invented by “colonizers”; a fact-check linking to Mr. Panopoulos’ Wikipedia page pointed out that he was known for the innovation of putting pineapple on pizza and was in fact Greek-Canadian.

If I were able to upload screenshots I would go through an example of using WikiCat Main Characters for historical research, like analyzing the Wikipedia categories of pharmaceutical executives over various months in 2020, but I’ll hold on that for the moment. I already have some updates I want to do for WCM so I’ll save that for the next article. After I get this theme fixed, of course.

Back To Top