As I wrote in Dutch in my last post, one thing I’m thinking about is writing a WordPress integration server. And I’m thinking about the architecture of it. I’m just writing down some things:
- It should consist of components that can run as a server themselves to scale even more when needed.
- It should log http requests in a http request database not only for logging but also to grab meta information from to provide as services
- It should have a request queue possible as transient, but I’m not quite sure yet how that works out with requests from the weblog that immediately needs stuff (can cause lagging), I think the service should be so smart to either say “yes I have it” or “no I will get it for you so please check in later” where a transient on the requester will pick it up later. But is a high dependency on the requester. We can also let the integration server “answer” later by then issuing either an xmlrpc request/code or db access to the requester.
- Let’s say that it will support 100 popular social sites to start with: so weblog registers a new twitter user. Which triggers some crons to perform for that twitter user. Depending on the choice the integration server can then e.g. post new posts with twitter content from that user to the weblog.
- Maybe a component will provide some basic mashup services.
- I’m not sure about the history component. If it will keep all history there would not be a need to store some data in the weblog but the weblog can just take a view from the history component in the integration server. No use to have duplicate data all over the place.
- Singe-sign on. Possibly it would be the node to talk to for single sign on requests to other applications. Would be pretty handy.
As an anti-requirement: it will run on a WordPress platform, not because it is needed, but because it is pretty cool to do so.
So why do I need it now?
Because I wrote the wp-favicons plugin which is now at version 0.5.5. While adding new versions I now have a plugin which pretty well works: it grabs favicons from websites and stores them locally with your weblog. It also has a request cache that grows enormous and caches ALL requests. It also has knowledge about every outgoing link on your weblog as well as its HTTP status (check the source code).
But… as always… I have some problems that I can not solve by just letting the code evolve further on this path. I realize that when even 1 request goes out for a favicon the website ” goes lagging ” until the favicon is retrieved. I realize that I could jquery a non blocked curl call but well.. think about it somewhat further yourself and you will realize that that path is not the correct path (it will still go doing http requests from that same server with that same cpu and that same memory so… you will see the same peeks in your stats…) (and to even answer “no I do not have that icon yet” it still takes processing time to verify the databases for existence of that icon and to then either answer or add that requests on a gigantic queue) (where after we have to check both the queue and the database). Uhm… so… a standalone server that could run on its own cloud instance with its own cpu and memory could a) re-use some code for other integration purposes b) does not bother the weblog itself c) holds then favicons which can be re-used over multiple weblogs and even non-weblogs while still holding that data on one of your own servers.
request cache database
One special component is the request cache database. It simply logs every http request. Superb for exactly knowing what happened and for further optimizing code without doing calls numerous times to the same websites. Currently, for favicons, that information lives forever. Meaning: if a website returns a 404 it will hold that 404 sign forever or until the request database is deleted.
The advantage is that you get a clear view of all outgoing links that are longer correct. Pretty handy and easy at this stage of beta. (it also support http code 418 grin)
But… obviously we will need to add some kind of transient key to this in the future. E.g. a request to the Twitter API / etc… makes the url unique but the http header and body that is returned are many times only valid for a short period. So maybe that should be a parameter for each kind of request. Where the record itself states how long the information is valid to prevent looking through all kind of code looking for the correct time-out values.
The size of that database is also a problem. In my first experiment I managed to get that data dumped to my localhost but importing it in my local MySQL database failed because of the size. Really. It can get that big that quickly. Maybe I should at forehand make some kind of taxonomy of the data and distribute that information over multiple databases. I wonder how many servers Google needs to index all URL’s of the world including their content but I noticed it grows big very fast. This weblog contains only about 31.000 valid outgoing links (and lots of misformed urls where I made mistakes) and that already made it a big pile of data (also notice that 301 leads to new requests and a request for an icon is another new request) (so yes it also contains approx. 15.000 favicons in binary format).
So to make this thing work for even larger sites I really need to move in this direction first and then re-implement the favicon plugin and other like-wise plugin to run on the WordPress integration server.
Coming to you somewhere the coming years or … if you have some pretty good WordPress, PHP etc… knowledge… maybe we can join forces to GPL this to the world.