Worked on the report
What more can be said?
« November 2004 | Main | January 2005 »
What more can be said?
After all, the optimisation for variables in queries is not that easy. I am too lazy to explain the problem in detail, but in short, to make it work correctly, I have to move variable declarations around in the AST (let and record selection). This makes everything quite a bit more complicated. Then also, I must admit, I am not working at maximum efficiency these days: it will get better, I hope.
After this extended Christmas week-end, I decided to restart work with a simple problem. I worked on a mechanism to push variables to queries at runtime and a way to optimise it. You will find an example if you continue reading. A first version is working, but I am adding a few refinements so that the selection optimiser detects some tricky situations where the variables in the comparison are hidden behind let statements.
Here is an example of this: to find every book published by every author in a database, you use the following SLinks program:[bag {auth.#name,
[lst pub.#title | ^pub <lst (table "publication" with {#publication_id:int,#author_id:int,#title:string} order [#title:asc] from db), pub.#author_id == auth.#author_id]
} | ^auth <bag (table "author" with {#author_id:int,#name:string} from db)];
This is how it look after the current optimisation phase and in internal representation (slightly simplified):for auth <bag table 'SELECT table_10.author_id AS author_id, table_10.name AS name FROM author AS table_10' from db in
[bag {
#1=let {auth.#name,
#2=(for pub <bag table '(SELECT table_5.author_id AS author_id, table_5.title AS title FROM publication AS table_5)' from db in
(if (pub.#author_id == auth.#author_id) then [bag pub.#title] else [bag])
}];
However, the (pub.#author_id == auth.#author_id) comparison could be pushed into the second SQL query since auth.#author_id is known when the query is executed. This would lead to a new optimised expressions like this:...
#2=(for pub <bag table '(SELECT table_5.author_id AS author_id, table_5.title AS title FROM publication AS table_5 WHERE table_5.author_id = $ table.#author_id $)' from db in
[bag pub.#title]
...
Where the expression between $s in the SQL query is added to it at runtime.
I have a first functioning version of the type-based project optimisation. with re-typing and environment passing. My first tests have been successful, but I'll have to think a little bit more about the behaviour of the optimisation to be fully convinced it works in all cases (it is suspiciously simple). Anyway, I'll do that after Christmas. I have also started thinking about the best way to modify SLinks to use system F instead of the current 'attached types' to store typing information.
Which leads me to a more important point in this post: I wish a happy Christmas to any (potential) reader of this weblog.
I continued working on the type-based SQL projection optimisation. I haven't finished but I believe I know exactly where I am going; and I would like to make a few comments:
If you consider the following example where type-based optimisation would be superior to the current term-based version (as soon as a function is applied to an iteration variable, the current optimisation goes to a default worst-case state where all fields are selected): [bag (fun ^x -> x.#c)(x) | ^x <bag (table "tata" with {#a:int,#b:string,#c:float} from db)]. When running the optimiser, and when the apply 'fun(x)' node is processed, the optimiser needs to look at the type of the function to determine what 'x' needs to be. Unfortunately, the type kept in the AST is a neatly unified type: '{#a:int,#b:string,#c:float} -> ;float'. This doesn't give any information: I know that x can have an a, b and c field because the table provides them, that's not the point. What I want is the type of the function alone ('{#c:'1,'2} -> '1'). This means that I have to re-type this sub-expression every time (or change the type information stored in the AST).
Secondly, In this example, we are happy to have the function directly available. But in this case '[bag extract(x) | ^x <bag (table "tata" with {#a:int,#b:string,#c:float} from db)]' it isn't. If we want to have the type of the 'extract' function, we need to have access to the typing environment (or in the case of a compiled program, to a 'memory' of earlier values). This adds a new dependency to the optimiser.
For now, I am working on a solution with re-typing and environment-passing because it's easier to implement rapidly, and because it shows how to solve the problem (if the type is recalculated or is stored in the AST, it doesn't change the principle), but it is a solution that really isn't pretty. Yesterday's post was speaking about System F. I don't know if this is exactly the right solution for type information in the AST, but something like this will certainly need to be done in the long term.
Philip Wadler raised the possibility to use as 'internal' language a 'Church-style' language such as system F (I believe the reference paper for this is John Reynold's 1974 paper 'Towards a theory of type structure') to allow an easy transformation of expressions and types during optimisation. In such a system types are part of the expressions. Optimisations on such expressions will keep a correct type since the optimisation is done on the expressions and types are part of them.
In my current system though, types are not part of the expression. Instead, once calculated by type inference they are kept for each expression (that is every node of the AST has a type attached to it, which is the type of the expression for which it is the root) as a buffer of what happened during type inference.
I tried looking into a way to merge both things, but didn't come up yet with a solution to change my current system easily. I also played a little with the type-based optimisation itself, but without any breakthrough there either.
To work on the type-driven optimisation system, I need of course correct types for the expressions. However, some of the optimisation modify the type of an expression (not the top level type, of course, but the type of some internal sub-expressions). Here is an example: In '[set x.#a | ^x <bag (table "tata" with {#a:int,#b:string,#c:float} from db)]' the table operator has type '[bag {#a:int,#b:string,#c:float}]'. However, after a selection optimisation, the type of that table operator is '[bag {#a:int]'.
This means that the expression to optimise must be re-typed between two optimisations, which is a severe efficiency drawback. Another more complicated (I don't know if it is reasonably feasible) solution would be to modify not only the expression but also the type during optimisation. But since the SLinks interpreter is a prototype for a language that should ultimately be compiled, this is probably not such a big problem.
With these correct type information even after the optimisation phase, I have been able to add dynamic type checking of queries. That means that when a table operator is executed, the result provided by the DBMS will be tested to see if it is compatible with the expected type (if the interpreter knows how to translate every column into the right SLinks type). That also means that if the system was until now a "type erasure" system, this is no longer the case: the type of the table operator is used during interpretation.
And finally, I have changed the table operator (it's representation, typing and interpretation) to make it more powerful. An example of the new syntax is: 'table "name" where {#a:int,#b:string} unique order [#a:asc,#b:desc] from db'. Notice the simplified types that are now used. The behaviour of 'unique' and 'order' are self-explanatory. Order is optional, the default value being no ordering at all. Notice also that the type does not require the class (bag, set or list) of the collection anymore (only bags were allowed until now). This is now calculated automatically: if neither 'unique' nor 'order' are set, it is a bag, if only 'unique' is there, it is a set, if 'order' is set, it is a list. This is also a simplified solution to the addition of a 'sortby' operator.
As a side note, this points to a slightly non-logical decision in the design of the language. SLinks has 3 classes of collections, but there should really be 4 of them: ordered or unordered with or without duplicates. In SLinks, following the lines of Kleisli, the ordered collection without duplicates is missing.
And with all this, I didn't have time to work on the type-driven optimisation system itself today, but the structure is here, and I will try it tomorrow.
I believe I have corrected the bug now. Philip Wadler proposed another solution to solve it by changing the optimisation rules to use information of types instead of the expressions themselves. I also tried looking into this solution, but haven't solved it yet.
I continued working on correcting the problem described in yesterday's post.
I feel a little bit as if I was playing this game, and I just hit a snake. The bug I was speaking about yesterday actually was a little bit more than a simple bug. There is a real problem with the SQL optimisations: the projection optimisation (that selects only needed columns) currently does not work with queries with joins, and can therefore not be used after the join optimisation. I know the problem comes from the fact that I loose the information of what table a column comes from when I join two tables, but I haven't implemented a satisfactory way to keep it around until when it is needed.
In continued working on the sortby. Unfortunately, I discovered a bug with another optimisation that I had done earlier (an optimisation that removes useless let and record selection operators, and that is used in conjunction with the SQL optimisations). I haven't fixed this one yet.
I continued improving the current optimisations, and started with the addition of a sortby operator that could be optimised to a SQL query.
I spoke with Philip Wadler about where to go next. Here are the ideas (but considering that only two month are remaining, a selection will have to be made):
Otherwise, I have worked on improving/fixing some problems with the current three SQL optimisations.
All right, the join merging optimisation algorithm is now working. When a 'for' operators on a DB table is nested inside the body of another 'for' operator on a DB, the optimiser will transform it into a single for operator with a join in the query.
Limsoon Wong describes in his paper "The Kleisli Approach to Data Transformation and Integration", two conditions that have to be fulfilled for the optimisation to be applied (and my current implementation requires them), but it might be something to think about:
The algorithm for join optimisation itself is more or less complete, I believe. The problem now is that the SQL part of the program is structured in such a way that it does not allow the kind of access to queries needed for this optimisation. I have started improving that, but this means that I still haven't tested my algorithm.
I continued working on the join optimisation. I must admit I was not very efficient at the end of last week (in swiss one would say "j'ai pétouillé"): I should have grasped this optimisation quicker as it wasn't very difficult really. I believe I am back on track now: I understand the algorithm and have a partial implementation. I hope to finish it tomorrow.
I continued working on the join optimisation. In particular, I had to improve the SQL sub-part of SLinks to allow it to do what is needed for this optimisation.
I corrected the bug in the selection push optimisation, and it now works beautifully. I then started working on the third optimisation to migrate joins.
The first version of the selection push algorithm is now up and running. Unfortunately, it works only in some cases: '[set b | ^{#a=^a,#b=^b|^r} <bag (table "tata" with [bag {#a:pre(int),#b:pre(string),#c:pre(float)}] from db), a == 1];' works (it sends 'SELECT tata.a AS a, tata.b AS b FROM tata WHERE (tata.a = 1)' to the DBMS) but '[set x.#b | ^x <bag (table "tata" with [bag {#a:pre(int),#b:pre(string),#c:pre(float)}] from db), x.#a == 1];' doesn't.
This might not seem a very satisfactory result, but I think the problem is easy to fix, and I'm pretty confident the general structure of the optimisation is right. I hope I will be able to start the third optimisation (joins migration) early tomorrow.