Thursday 10 June 2010

Cocoon XML Pipelines

I was asked for an introductory description of pipelines and found nothing better than this old Cocoon intro.
Do you know some good one?

Main thing I want to highlight - pipelines are completely different paradigm to work with XML, although it has all the same parts - SAX, XSLT.

Actually you can look at W3C XProc - this has much more features, but I know no good implementation.

Cocoon is really good and comfortable - good Spring usage, everything is automated with Maven, very easy to make extensions using existing components. And Cocoon does everything I need.

XR(P)X: my current activity

I'm piloting a system to collect and process quite large amount of semi-structured data from XML and HTML source.
Hope to create and share Spring3 Sedna API wrapper and Cocoon3 Sedna components.

// XR(P)X - РСДРП(б) haha :)

Native XML DB: Sedna

My choice is Sedna. eXist doesn't seem to be effecient enough.

Some motivation:
  • C++
  • Russian Academy of Sciences
  • List of authors' publications - I respect algebraic approach.
  • Full transaction support
  • Solaris/BSD/Linux/Win
  • active development, active community

Some links:

Unfortunately I lost a link to newer performance analysis using XMLish data, as I remember it was on part with MySQL. Please share it with me if you find it.

Oracle

Yes, latest Oracle has quite interesting XML database implementation with automatic shredding of XML into tables (in case of schema available) using its in-database objects support (kind of ORM if you want). But you need a schema (not always available), can't estimate XQuery performance. And also you need to give them money :) Actually Oracle would be very interesting for a database mixing classical relational things and XML - if you need links between & etc.

One situation when Oracle is perfect - your client is a relational freak with bunch of money. And you need XML database and that money.

But finally I choose Sedna.

My experience

In case of far not the fastest request doing substring search on a big collection Sedna produces data much faster and with much lower memory usage than Cocoon 3 parses and processes.

API

Sedna has own JDBC-like API.
Charles Foster built XQJ and XML:DB wrappers on top of it.

I use Sedna own API as I want to get maximum performance. Currently I started to implement Spring 3 wrapper around following Spring JDBC wrapper template. I use it in Cocoon3 components. Some day I will share it as when I am satisfied with its quality.

Tuesday 8 June 2010

User interface: XForms on client, pipelines on server

XForms is a really nice standard. From upper level XForms act as a web service(s) client (both SOAP and REST) - it gets data from a web service, allows user to manipulate it and sends data back to a web service. So we've got a standard browser based UI acting as a web service client. No AJAX needed to program.

Inside XForms has clear model-view-controller separation, provides schema based validations & etc. It is possible to produce default XForms from XML metadata for example.

Browser support

XForms is a part of XHTML 2.0 standard, so sooner or later it will be available in any browser out of the box.
Currently only Mozilla provides good implementation as an add-on, which is acceptable for intranet applications, where you can require users to have it.

For other cases you can can transform your XHTML+XForms to XHTML+AJAX using XSLTForms as a temporary solution.

Produce pages with pipelines

Remember the hell of JSP-JSF, taglibs, custom tags, hidden or direct AJAX usage, binding POJOs... Numerous appearing libraries, their complication, learning curves...
And main problem - JSP pages turns into some history of everything. To separate styling for example you need to make own tags, which is a pain.

I'd rather forget it and use pipelines using Apache Cocoon:
  • Static skeletons of pages with XForms.
  • Fill them with dynamic data (but better would be use XForms to query data directly)
  • Finally apply styling and other whole-site functionality to all pages in one place.
  • And you still have freedom of Java-based controller if you need it.
  • You can go further, for example produce XForms and pages from metadata.
Main advantage is simplicity and separation of different concerns - content management, data access, styling and other features.
Special joy here is that you always can use appropriate tool in your hands - operate XML data with XSLT, query it with XQuery, implement controller logic in Java. And you can always go a meta-level up if you need it. You just handle more complexity at lower development cost.

Architecture: XRX? XR(P)X.

If you check lists in a previous post, you mention, that from higher level view an enterprise application turns into some black box, producing and consuming XML. If you look into, you will see XML metadata, transforms from one XML format to another.

The typical way would be follow example poor Seam application: WS-XML-Binding(JAXB) - Java(C#) processing - ORM (Hibernate). And back. And forth. And back... For different formats. If data is semi-structured 3 man-years will not be enough...

Landscape

So, do we need convert XML to a language representation just to make pair of transformations and send somewhere further? Can we use XML as format to represent data upon its full path effectively? As an Apache Cocoon user I answer yes and often more effectively, especially in memory terms.
And it would be essential to store XML as it is natively in XML database. Native XML databases provide currently amazingly good performance and reliability, both commercial and open source. Also Oracle and DB2 provide effective XML-relational hybrids.
To complete this idyllic XML landscape I will mention XForms - a forms layer from XHTML 2.0 standard: You describe form presentation, logic and model separately in XML, form queries and submits data in XML using REST - no DHTML Jiu-jitsu, no AJAX needed from you.

XRX

So XForms on client, REST communication with server, XQuery to XML database. Dan McCreary named them XRX in his famous post Introducing the XRX Architecture: XForms/REST/XQuery. His article describes an idea very briefly and clearly.
In brief - having an XML DB with REST frontend, capable to translate REST requests to XQuery you need nothing else. Here XML database plays to roles - application server and database itself.

Check XRX Wikibook for examples. Some typical example from wikibook of simplistic XRX approach
clearly illustrates first problem of this over-simplistic approach -
  • Layout, forms and data access logic are mixed - no any content separation.
Analogue in relational world would be Oracle HTTP package, where you expose stored procedures to web.

Other problematic moments when going large scale:
  • Too much processing put into database.
  • How to make queries to outer services?
  • Security concerns - exposing database to public HTTP...
  • Integration with existing ecosystem (security, portals & etc)
  • Access JEE components.
  • How to make some other things in Java if needed? REST controller for example?
Add pipelines

I have a desire to put something between REST frontend and XQuery interface in XRX. Definitely no any JAXB, just a tool to manipulate XML flow, make XSLTs, HTTP queries, combine results, make XQueries - an XML pipeline.

A pipeline concept sees XML being processed to flow throw line of pipes.
XML is initially generated - from HTTP (or other protocol) request or other source possibly using request parameters - this can be and XQuery to DB or just reading a file.
Next XML goes throw some transformers (pipes), each can modify it content using also data from other sources. For example first XSLT can prepare an XQuery HTTP request in some XML node, next transformer can use it as an instruction to perform a request and put result back to XML. Finally XML is sent to client or stored in its textual form.
Using special techniques like SAX or StAX event sequential processing makes pipelines effective in both memory and CPU terms, still allowing to manipulate XML in comfortable way.

Although new XML pipeline standard XProc appeared, I know no good implementation to use and I like Cocoon. New Cocoon 3.0 gives a freedom to mix Java and XML in a way I find pleasant and addresses all issues with simplistic XRX well.

So lets add a Pipeline to XR(P)X.

In next posts I will see at individual components I briefly mentioned here.

lazy?

An anecdote: guy (from some special social group if you desire) works at construction site - carries bricks. One brick at a time/pass :). Manager asks him:
- Why are you so lazy? Look, that guy carries ten bricks at once.
- Huh, I am not, but that guy is lazy indeed - he tries to avoid additional passes.

In coming posts I'm going to evaluate how lazy we can go developing typical enterprise systems ;)

First of all let me mention some typical real world situations we face with:
  • Heterogeneous web services, often based upon legacy standards, very often not fully standard compliant. Sometimes to communicate with such a creature it's reasonable to use a same legacy framework.
  • Produce and communicate exotic text or XML formats over FTP, HTTP & etc.
  • Parse and produce XMLs again and again.
  • Convert received data structures to own one, store it
  • Often these structures are not so exact and strict as described.
  • (XML) metadata
And some sweets:
  • Collect data from web pages, non-valid HTML.
  • Process semi-structured XMLs.
  • Metadata drivenness.
What about to have them all? One would say it is a straight way to mental clinic. Well, depends on drugs ;)

What tools and methods to use?
"Solid" language based technology stacks like JEE or .NET fit an ideal world, but the chaos of reality makes development a nightmare, codebase becomes fat and ugly.
Lightweight language solutions handle chaos well just following it until you get lost in chaos of your codebase.

I remember a project done with cutting-edge tools of own time - Seam, Spring, Hibernate & etc, doing very simple things - fill some documents collecting data from web services, sign them digitally, store and forward to other web services, track status. This simple thing turned into 20-30 tables per document, 3 experienced man-years and endless troubles with maintenance.

So, until you are not a manager getting revenues from a hundred of coders on a state project no one will really use, you need an escape. Me too.

Keywords would be: XForms, XML, native XML database, XML Pipelines, XQuery, XSLT, XRX?