Finding celebrity locations via Twitter — Celebrity Exif data mining

:: UPDATE ::

Nate Beck gave a presentation on this topic at Ignite Seattle 13, here’s the video.

Last week one of my favorite video podcasts, Hak5, had a segment on Exif data mining (Episode 721). In the episode Rob Fuller (a.k.a. Mubix) shared his experience, how his images contained unwanted GPS information embedded in the Exif headers.

While mining Exif information from images is nothing new (Version 2.1 of the specification is dated June 12, 1998)… most consumers don’t realize the kind of information attached to their images by default.

In February of this year, Johannes Ullrich over at ISC published a paper titled Twitpic, EXIF and GPS: I Know Where You Did it Last Summer. Johannes explains:

“Modern cell phones frequently include a camera and a GPS. Even if a GPS is not included, cell phone towers can be used to establish the location of the phone. Image formats include special headers that can be used to store this information, so called EXIF tags.”

So after watching the segment, I was curious how widespread the issue is and decided to conduct my own investigation, codenamed Seeker.

Harvesting images

I decided to write a quick ruby script to harvest images from multiple image sites. The sites that I targeted were: Twitpic, Twitgoo, Tweetphoto, and yfrog. Some were easier to harvest than others, mostly due to easily accessible APIs.

After running the script on myself, a few of my friends and the team over at Hak5. I decided to widen my search to include public figures and celebrities.

Finding celebrities

At first I just scoured the internet looking for verified accounts for celebrities. I found a few sites that have lists of Twitter accounts for celebrities. My favorite one is WeFollow, which is owned by Digg.

So I wrote a script to collect Twitter account names from WeFollow. All I needed to provide the script was a category and a number of pages to scrape.

I edited the results and tossed in a few people that weren’t on WeFollow, and I came up with the following list of 147 celebrities.

Below is a glimpse of the log file while the script is running.

And in about 42 minutes… I had 11,688 photos from 147 Twitter handles. Not every celebrity on the list had images on those services. In fact I couldn’t detect images for 22 of the celebrities.

Processing 147 users
Starting twitpic
Finished twitpic -- 985.437336 seconds
Starting yfrog
Finished yfrog -- 980.306637 seconds
Starting tweetphoto
Finished tweetphoto -- 475.657377 seconds
Starting twitgoo
Finished twitgoo -- 85.571865 seconds
Total Time : 2526.98922 seconds

Extracting the Exif data

Now that I had 11,688 images, it was time to go through the images and see what kind of gems were hidden in the metadata.

So I wrote yet another ruby script which goes through each image and dumps all of the Exif data into a text file.

And the result…

44 users affected
125 users total
GPS count: 878
Total count: 11688
Percentage: 7.51%

Success! 878 images out of 11,688 have GPS data.

Visualizing the information

Now by day I’m an Adobe AIR developer, so naturally I decided to write a simple Flex 4 interface to help me visualize the information I collected, codenamed SeekAIR.

I was able to find personal addresses for some of the celebrities, but have opted not to share that information. The images I have chosen to share are public places where the location is obvious.

Since it was Darren’s show that sent me on this investigation… let’s take a look at Darren’s images.

Darren Kitchen (hak5darren)

First I checked Darren’s TwitPic “Places I’ve Been” page to see if he has GPS enabled on purpose…

It doesn’t seem so.

Let’s open the photos up in SeekAIR and see what we can find…

These are just a couple of the images that I found from Darren which had GPS data encoded in them… I also learned that Darren took these photos on his Droid phone because that information was in the Exif data.

Not that Darren Kitchen isn’t a celebrity in my life… but let’s take a look at someone a little more interesting.

Adam Savage (donttrythis)

Another show that I love is MythBusters, so naturally Adam Savage was on my list of people. Again I checked Adam’s “Places I’ve Been” on TwitPic…

Same thing, apparently no images have locations. Again this is misleading, because many of Adam’s photos have GPS data in them… for example:

You may argue that Adam Savage isn’t a celebrity, and I’d have to fight you on that account. But in any case, let’s move on to another example.

Tom Hanks (tomhanks)

Once again to show I have nothing up my sleeve, let’s check Tom’s “Places I’ve Been” page.

Now let’s open the photos in SeekAIR…

This one shows Google street view of the location where the image was shot.

And this one shows Tom at Pixar.

Once again we have found GPS data hidden within these images. But… GPS data is not the only information that is included in Exif headers.

Britney Spears (britneyspears)

Britney didn’t have any GPS data in her photos, but nonetheless other information can be found in the Exif data.

This one made me laugh…

I mean, we all know that almost all celebrity photos have been Photoshopped, but this photo has the proof embedded right inside of it.

Collection Statistics

Now, you may be asking yourself exactly how many images are affected, so let’s take a look at the statistics.

Breakdown by Device

The following chart is a bit startling. I’m not going to draw any conclusions about it… perhaps the Apple iPhone is the most popular device among celebrities.

The article from ISC has a better chart showing the cross section by device for the general public.

Affected Files By Site

As you can see in the below chart, the majority of images came from TwitPic. It seems to be the most popular image service.

Affected Users By Site

You’ll notice the total users on this chart is 214; this is because some users had pictures on multiple image services. The blue bar represents the affected users out of the total users for the site.

Where to go from here

So what can we do to protect ourselves going forward? This issue affects everyone, not only celebrities. Consumers should be aware of what information is leaving their mobile devices.

Remove previous images

The thing about Twitter is that tweets expire. So the tweets that correspond with a particular image may no longer be available. According to the Twitter API Wiki:

“We also restrict the size of the search index by placing a date limit on the updates we allow you to search. This limit is currently around 1.5 weeks but is dynamic and subject to shrink as the number of tweets per day continues to grow.”

Turn off location services for the camera

Thankfully, the latest release of Apple iOS4 has the ability to turn off “Location Services” specifically for the camera application.

If you’re not an iPhone user, your device should have similar settings.

Scrub your images before you upload them

Since Seeker was just a weekend project, I haven’t gotten around to this yet… but ZaaLabs will be releasing a free Adobe AIR application to scrub Exif data from images before uploading them to these image services. Stay tuned.

Hey where can I get Seeker?

I will not be releasing the Seeker or SeekAIR code or applications to the public.

A note to the users mentioned in this post

ZaaLabs is willing to assist in identifying and removing affected images… Contact us.

Related Posts

Comments Closed