I do the same thing as one of the many pieces of my somewhat messy Parse This library. Parse This, which is designed to feed WordPress plugins, forms the basis of the reply-contexts in the Post Kinds plugin, the parsing for the Yarns Microsub plugin, and my newly released bookmarks plugin. In all cases, it tries to extract as much data about the URL sent to it, and return it in microformats 2 json, or the simplified jf2 format.
Jamie’s code is a simple 80 lines that takes a few tags and tries to convert them. I ran through every meta tag I could find by looking at dozens of different sites, so I was inspired to document same.
First of all, if you look at MDN’s definition for the meta tag, it states that if the name property is set, the meta element applies to the entire page, but if the itemprop property is set, that’s user-defined metadata. The content property contains the value for the name attribute. There is no mention of the attribute property in the HTML spec, but it is mentioned in the OpenGraph protocol.
I take name, property, or itemprop and map it to the key in an associative array, then content is the value. For values with curies(:), I use that to create a nested array, which is what I use to map properties.
There are common classic meta names that are longstanding and defined in the HTML specification, such as author, description, and keywords. If nothing else, this might generate some simple information.
Moving up a level to OpenGraph…there are several common metadata fields, namespaced with og.
- og:title – this would map to p-name
- Media – Some media has the :secure_url addition for the https version of the image. This is still used, although the modern utility is sometimes questionable.
- og:image – this would map to u-photo.
- og:video – this would map to u-video
- og:audio – this would map to u-audio
- og:url – this would map to u-url
- og:description – this would map to p-summary
- og:longitude, og:latitude can map to the equivalent location
- og:type – The type is a bit harder to map, but can be used as hinting otherwise. Article as a type would be considered h-entry, profile would be h-card, music and video types would be h-cite.
Of the various types, music and video types are not really represented well in Microformats. So let’s focus on article first.
- article:published_time – mapped to the dt-published property
- article:modified_time – mapped to the dt-updated property
- article:author – mapped to the author property
Many of the types have a tag property, that can have one or more tags…which get mapped to category.
Jamie opted to map the Twitter namespace properties as a secondary factor. I opted not to. The namespace is from their Cards specification, which is really just another OGP namespace. The problem is that they don’t provide an author name or website, only their Twitter handle. The majority of sites I viewed had both the og and the twitter namespaces, and I never got anything from the twitter namespace that wasn’t in the og namespace except Twitter specific details, which I wasn’t interested in. Facebook was responsible for OGP, so most people want to cover both sites, so they have both.
I did opt to look for the custom namespace for FourSquare venues, which is playfourquare, for latitude longitude. I also considered the presence of the namespace to indicate a FourSquare venue, and therefore an h-card.
- playfoursquare:location:latitude – maps to p-latitude
- playfoursquare:location:longitude – maps to p-longitude
After the OGP tags, I also looked for some other common meta tag names.
Some academic sources use Dublin Core properties in meta tags:
-
- DC.Creator – p-author
- DC.Title – p-name
- DC.Date – dt-published
- DC.Date.modified – dt-updated
Parse.ly, which is part of WordPress VIP, has its own markup.
- parsely-title – p-name
- parsely-link – u-url
- parsely-image-url – u-photo
- parse-type – post is h-entry, index would be h-feed
- parsely-pub-date – Publication date
- parsely-author as p-author
- parsely-tags as the p-category
- They also offer the property parsely-metadata for other fields which is json encoded.
I also convert JSON-LD to microformats, but that’s another story
Sounds exciting.