The WordPress Integration Server

As I wrote in Dutch in my last post, one thing I’m thinking about is writing a WordPress integration server. And I’m thinking about the architecture of it. I’m just writing down some things:

  • It should consist of components that can run as a server themselves to scale even more when needed.
  • It should log http requests in a http request database not only for logging but also to grab meta information from to provide as services
  • It should have a request queue possible as transient, but I’m not quite sure yet how that works out with requests from the weblog that immediately needs stuff (can cause lagging), I think the service should be so smart to either say “yes I have it” or “no I will get it for you so please check in later” where a transient on the requester will pick it up later. But is a high dependency on the requester. We can also let the integration server “answer” later by then issuing either an xmlrpc request/code or db access to the requester.
  • Let’s say that it will support 100 popular social sites to start with: so weblog registers a new twitter user. Which triggers some crons to perform for that twitter user. Depending on the choice the integration server can then e.g. post new posts with twitter content from that user to the weblog.
  • Maybe a component will provide some basic mashup services.
  • I’m not sure about the history component. If it will keep all history there would not be a need to store some data in the weblog but the weblog can just take a view from the history component in the integration server. No use to have duplicate data all over the place.
  • Singe-sign on. Possibly it would be the node to talk to for single sign on requests to other applications. Would be pretty handy.

As an anti-requirement: it will run on a WordPress platform, not because it is needed, but because it is pretty cool to do so.

So why do I need it now?

Because I wrote the wp-favicons plugin which is now at version 0.5.5. While adding new versions I now have a plugin which pretty well works: it grabs favicons from websites and stores them locally with your weblog. It also has a request cache that grows enormous and caches ALL requests. It also has knowledge about every outgoing link on your weblog as well as its HTTP status (check the source code).

But… as always… I have some problems that I can not solve by just letting the code evolve further on this path. I realize that when even 1 request goes out for a favicon the website ” goes lagging ” until the favicon is retrieved. I realize that I could jquery a non blocked curl call but well.. think about it somewhat further yourself and you will realize that that path is not the correct path (it will still go doing http requests from that same server with that same cpu and that same memory so… you will see the same peeks in your stats…) (and to even answer “no I do not have that icon yet” it still takes processing time to verify the databases for existence of that icon and to then either answer or add that requests on a gigantic queue) (where after we have to check both the queue and the database). Uhm… so… a standalone server that could run on its own cloud instance with its own cpu and memory could a) re-use some code for other integration purposes b) does not bother the weblog itself c) holds then favicons which can be re-used over multiple weblogs and even non-weblogs while still holding that data on one of your own servers.

request cache database

One special component is the request cache database. It simply logs every http request. Superb for exactly knowing what happened and for further optimizing code without doing calls numerous times to the same websites. Currently, for favicons, that information lives forever. Meaning: if a website returns a 404 it will hold that 404 sign forever or until the request database is deleted.
The advantage is that you get a clear view of all outgoing links that are longer correct. Pretty handy and easy at this stage of beta. (it also support http code 418 grin)

image

But… obviously we will need to add some kind of transient key to this in the future. E.g. a request to the Twitter API / etc… makes the url unique but the http header and body that is returned are many times only valid for a short period. So maybe that should be a parameter for each kind of request. Where the record itself states how long the information is valid to prevent looking through all kind of code looking for the correct time-out values.

The size of that database is also a problem. In my first experiment I managed to get that data dumped to my localhost but importing it in my local MySQL database failed because of the size. Really. It can get that big that quickly. Maybe I should at forehand make some kind of taxonomy of the data and distribute that information over multiple databases. I wonder how many servers Google needs to index all URL’s of the world including their content but I noticed it grows big very fast. This weblog contains only about 31.000 valid outgoing links (and lots of misformed urls where I made mistakes) and that already made it a big pile of data (also notice that 301 leads to new requests and a request for an icon is another new request) (so yes it also contains approx. 15.000 favicons in binary format).

So to make this thing work for even larger sites I really need to move in this direction first and then re-implement the favicon plugin and other like-wise plugin to run on the WordPress integration server.

redirects

Plain simple calls that return ’200 ok’ are simple. The world out there exists out of a hell of a lot status codes returned. And depending on the type of functionality for the specific service/plugin a status code could mean one kind of action or another kind of action. E.g. For a favicon request a ’404′ is not the end. Since somewhere in that page a icon tag could be present that holds the reference to our needed favicon. Also difficult are redirects implemented in javascript. To build this tool correctly we will need some javascript parser that can tell us if we need to request another page or not. Suppose on of the webservices returns us a html body that contains javascript with code that, depending on the time of the day sends us to some specific page. We really need to be able to parse that javascript and then act upon it. So within our integration server we will need a component that is able to parse (the latest version) of javascript (and will probably need some amount of memory todo so).

!

Coming to you somewhere the coming years or … if you have some pretty good WordPress, PHP etc… knowledge… maybe we can join forces to GPL this to the world.

Ꭺ Ꮤeblog Ꮤith ᏔordPress | Alles over het maken van een Ꮤeblog met ᏔordPress!

[screenshot url="aweblog.net"] Ꭺ Ꮤeblog Ꮤith ᏔordPress | Alles over het maken van een Ꮤeblog met ᏔordPress!

Voor beginnende webmasters: alles over het maken van een weblog met WordPress.

Er komen nogal eens mensen naar me toe met minder web kennis (of geen) met de vraag hoe ze een eigen weblog kunnen starten zonder de beperkingen van een SAAS blog provider.

Dus ik ben dit maar eens gestart. Ik zal er op een heel laag pitje wat artikelen publiceren met als voornaamste doel er naar te kunnen verwijzen en dan ook echt gericht op primaire kennis.

Tegelijkertijd gebruik ik het ook maar eens als testomgeving voor mijn Swift theme vertaling.