Many people have written code to solve this problem. I do it in my library, Parse This. Aaron Parecki does it in his XRay library. The Microsub specification has a stricter jf2 output in order to simplify the client having to make all sorts of checks.
This is the point. It is easier to consume a clean and consistent parsed microformats structure. Some of this would probably be solved by some consensus on the matter.
So, what does Parse This, and its ilk do? I lack a name for this sort of code.
- It has two options: feed or single return
- Feed tries to identify and standardize an h-feed. This means if there are multiple top level h- items, it will try to convert it into an h-feed.
- Single will try to identify the top level h- item that matches the URL of the page.
- In both cases, it will run authorship discovery in order to find the representative author and add this as an author property to the h-feed, or single h-entry etc.
- It will try to run post type discovery.