« December 2004 | Main | February 2005 »

January 31, 2005

Reporting

I continued with the chapter on syntax, that I have more or less finished, and started some parts of the chapter on type inference. It is quite difficult to say exactly how much I have really written since I worked quite a bit on the text that existed already. The number of new pages however is 3.

January 28, 2005

More on syntax

I wrote about 5 more pages on the syntax of the language, and the syntactic sugar transformation rules.

January 27, 2005

Posting nevertheless

Philip thinks that, in order to keep my gloriously uninterrupted string of postings intact, I should write a report also when I work only on my report. So, today I worked on chapter 2: The SLinks language, and wrote about 4 pages on the syntax of the language. I also worked on putting order in my bibliography - I learnt to use BibTeX and started the ordering yesterday - which is horribly time consuming.

I was astonished by how little I wrote today. Then also, experience shows that my writing speed tends to increase constantly while writing such papers. Some kind of intellectual inertia, I suppose.

Currently, the proposed chapter list is the following:

  1. Introduction
  2. The SLinks language
  3. Compiler structure
  4. Automatic type inference
  5. Database access and optimisations
  6. Conclusion

January 26, 2005

The report

I have been working on my report today. From now on, when I work only on the report I won't post anything on this weblog: it doesn't mean I stop working.

January 25, 2005

There, at last

I finished the new selection optimisation I have been working on for a few days now. For my standard test set, it is working as expected (there is one case with joining on the same table where I am not quite sure, I'll have to think about whether it is really correct). Please read on for a little comment about types, this last optimisation and the projection optimisation.

To make things clear, the selection optimisation is the optimisation that pushes conditions into the WHERE part of an SQL query. The projection optimisation reduces the number of columns queried depending on what is really needed in the code by reducing the number of fields in the SELECT part of an SQL query. What I want to show is that the projection optimisation can use types to obtain the information it needs, while the selection optimisation must look at the actual code. I will use a new , hopefully easier to read, syntax for these examples. Please note that this syntax is not implemented yet.

Here is an example for the projection optimisation. Consider this expression:

{ x.a | ^x <- (table "tata" with (a:int,b:string,c:float) from db)}

The type of the body of this comprehension 'x.a' has type '(a:'a,'b)' where ''a' and ''b' are type variables. By using this type, one can easily see that the table will only use the 'a' field of the table, and hence optimise the query. Furthermore, since the type inference mechanism keeps track of the type of all values declared earlier, it will use this knowledge if such values are used in the body (for example for functions).

Now, for the selection optimisation. Consider this expression:

{ b | ^(a=^a,b=^b|^r) <- (table "tata" with (a:int,b:string,c:float) from db), a == 1 }

Which will be transformed into this after syntactic sugar has been removed:

for ^(a=^a,b=^b|^r) = (table "tata" with (a:int,b:string,c:float) from db) in
	if (a == 1) then
		{ b }
	else {}

The information that is needed to determine what can be optimised comes from the syntactic structure of the body expression (the expression after the 'for … in'). A condtion where one of the branches is the empty list can be pushed into the query. The information contained in the type, at any point in the above expression, is useless to detect this kind of optimisations. I do not have a proof of this, but I really believe that type information is not usable in this case. If a reader disagrees, I'd be happy to hear about his views.

A last little note, while writing this post, I noticed that for the projection optimisation, I still had a little bit too much syntactic exploration in my code, and I will correct that tomorrow.

January 24, 2005

Nearly there again

As I should have expected, a completely new code that wasn't tested at all will not work on the first try. I am still working on some bugs of the code. Hopefully I will finish tomorrow, but considering my recent track record on predicting the end of my work, it might not be the case.

January 21, 2005

Nearly there

This post actually spans two days: I decided to work on Saturday to move forward with this optimisation that was really starting to annoy me. Everything is in place now. I have to add a couple of extra helper functions, but this isn't much: the new optimisation for selection is nearly finished. I must say, it looks better than I thought it would. The code is conceptually pretty nice, and much shorter than before. Then also, I have ended up rewriting more or less the entire algorithm, so one would expect it to be better.

January 20, 2005

Still not there

I wasn't able to finish today the new optimisation code for selections. The problem is that I need to pass more things around than I thought and the current structure is not well suited for it: that means that I have to make broader modifications than I hoped to my original code. It will still require a little bit more work.

January 19, 2005

And go, and go ...

I continued working on the enhanced cleaner version of the code that I spoke about yesterday. Nothing much to say. I hope to have finished it tomorrow. After that, I hope to start seriously with the report.

January 18, 2005

Touch and go

I corrected the annoying bug I mentioned yesterday. The code where the bug was hidden was quite ugly though. Furthermore, I believe that by using a more streamlined approach, I would be able to optimise easily a case (with more complex variables in queries) that wasn't optimised with the current solution. This is why as soon as the bug was corrected, I broke the code again and have started rewriting a part of the code to push selections to queries. I believe this should yield to a significantly more powerful, readable and extensible code.

January 17, 2005

Annoyance

The bug I spoke about yesterday is resisting me. I was unable to fix it today. I know exactly where the problem is. But the code where the bug is is complicated and with plenty of dependencies (two optimisations depend on it), so I have to be very careful not to break something else.

January 14, 2005

Fiddling with SLinks

I modified some parts of the optimisation code to make it more readable. Unfortunately, I noticed that by correcting one of Wednesday's bugs, I had added a new problem that prevents one part of the selection optimisation from working. I didn't have time to fix this problem yet. I also started modifying the system for profiling without a lexer and parser, but this isn't ready either.

January 13, 2005

Better sort

I modified the sort operator to work on any collection. It also required a few small changes in the optimiser (when pushing the operator down the syntax tree, the operators in between must be modified to handle lists instead of bags or sets), but I think this concludes my work on this for now. I also tried looking into the profiling of my program. Unfortunately, it seems that the database module is not the only one to use threads (that prevent O'Caml profiling from working) but the lexing module does so too. This is really annoying: I probably will have to write SLinks programs directly in internal representation, and drop the parser entirely for those tests.

I also spent some time trying to understand the "Obj" module for O'Caml, which is a (somehow hacky) method to access the internal representation of data at runtime. I was hoping this would help me to debug my code (not that I need that, of course) by being able to inspect what is going on at runtime. Unfortunately, too much information is discarded at runtime for it to be practical. I got some nice graphs of the run-time structure of my data though.

January 12, 2005

Three bugs

My new test cases for the sort operator pointed to three new bugs in my code. A small one with the sort optimisation, and two more in older code. The latter two where annoying enough to take me the whole day to fix. I was a bit astonished that I didn't see them (or at least one of them) earlier. My brain might have been in deeper hibernation during the Christmas season than I thought. Anyway, it is waking up again since I am starting (hopefully finishing too) to find and fix the left-over bugs.

The sort operator currently only works on lists, which is very limited. Tomorrow, I will modify it to work on bags and sets too. After that, I will try to profile my code for gross inefficiencies (to do this, I have to write a database-less version, because the database modules requires threads that are not supported by the O'Caml profiler).

January 11, 2005

Sorting out the sort optimisation

I have a first implementation of the sort optimisation. I didn't have time to test it thoroughly, so I am not quite sure whether there won't be some hidden problems. But it is too late today to do the testing. I will leave this for tomorrow.

January 10, 2005

Sort operator 3

I fixed a little bug in the interpreter (that Jeremy discovered) with the way the environment is handled when calling functions. I also tried finishing the 'Sort' operator optimisation, but it really requires quite a bit of special cases handling, which is very time-consuming. So it still isn't finished.

January 07, 2005

Other people's work

I went to three talks about query rewriting. the first one was about the properties of the relations between queries, modified queries to be executed on a view of this database, and results. The second one presented a technique to modify queries to handle properly inconsistent data. It works by adding constraint to the query to drop the inconsistent data (useful when the constraints are not part of the DB schema). The third one was officially about 'Rewriting queries using views with access patterns under integrity constraints', but I'm afraid I didn't understand anything.

I also read Jacques Garrigue's paper 'Code reuse through polymorphic variants' (after fighting an hour with the stupid printer to print it). It claims to define a simple solution to the expression problem using O'Caml's polymorphic variants. I believe the solution is fundamentally the same as the solution I had outlined for SLinks (that requires recursive types which are not part of the language right now) in an earlier post.

He does however provide a solution for a problem I did not solve. If you consider this example: 'defrec ^eval2 = fun ^exp -> case exp of <#p={#l=^l,#r=^r}> in (eval2(l))+(eval2(r)) | ^e in eval(e);', the recursion is always done on eval2. That means that if an eval3 function to support extended data is defined, if will not be possible simply to delegate the evaluation of the known cases to eval2 (as eval2 does with eval), because the recursion in eval2 is on the wrong function. To do this, Guarrigue uses open recursion which solves the problem quite nicely (and could, I believe, also be used in SLinks with recursive types).

However, his solution also has the same kind of problems as mine: new functions extending the supported data must have new names (one cannot simply redefine the old function). This means that code using the first version of the function will not support the extended data for free.

January 06, 2005

Sort operator 2

I started the day with an interesting talk by Michael Schwartzbach about 'The design space of type checkers for XML transformations'. But the rest of the day was not quite enough to have the sort operator optimisation working. I will continue working on it tomorrow. But in the meantime, here is a little comment about this operator:

Consider this SLinks expression: 'sort_up [set x.#a | ^x <bag (table "tata" with {#a:int,#b:int} from db)];'. Optimising the 'sort_up' operator into the query would yield something like that: '[lst x.#a | ^x <lst (table "tata" with {#a:int,#b:int} order [#a:asc] from db)];'. The problem that needs to be solved for this optimisation is on what column the ordering must be made, and to do this, there must be a system to inspect the expression of the body of the list comprehension to determine exactly what the list is made of. Using the type of the list is not sufficient (it can easily be seen in the above example).

However, if one considers this expression: 'sort_up [lst x | ^x <bag (table "tata" with {#a:int,#b:int} from db)];', the problem is different: what is the correct ordering? It depends on how the comparison between two records is made for the ordering. Probably the most intuitive solution is to optimise it to '[lst x | ^x <lst (table "tata" with {#a:int,#b:int} order [#a:asc,#b:asc] from db)];'. This implies that the comparison between two records with fields a and b must be implemented as 'if (a1 < a2) then true else (b1 < b2)'. This has two problems: Firstly, it is by no means clear whether this is the right comparison for a particular data type. Secondly (and more importantly), since records have (conceptually at least) unordered fields, this comparison is not well-defined. Or more precisely, it requires to make assumptions about the implementation (that the original ordering of fields is kept).

January 05, 2005

Sort operator

Following yesterday's post about the 'variables in queries' optimisation, It seems that Philip prefers the 'other conceptually rather nicer' solution to the problem. I was pretty much about to change my mind and follow that path anyway because the 'brute force' method was really becoming nasty. However, to implement the moving declaration optimisation nicely, it would be useful to change slightly the representation of the AST. But since the AST might change anyway if a new Church-style internal representation is used, I decided to do something different for now.

This is why I came back to the 'sort' operator (a standalone sort operator, not the sort option for queries), and started working on a way to optimise the sorting to queries. It isn't ready yet, but shouldn't take too long.

January 04, 2005

Moving variables declarations

I have been working on the question of what to do with variable declarations with the 'variables in queries' optimisation. I have found two solutions available to this problem.

The first one is a 'brute force' solution where the optimisation takes explicitly into account the necessity to move those declarations around. I have started implementing this one.

The other is conceptually rather nicer, but might be less efficient (and has a few other problems). It would be to modify the AST to push all variables declarations always to the earliest possible position (where everything needed to calculate the value is known) in a separate optimisation phase. This would solve the problem of the 'variables in queries' optimisation, and might yield to other optimisations as side-effects (pushing the calculation of variables' values outside of loops, for example). On the other hand, it might be counter-productive in some other cases (a value calculated in only one of a condition's branch pushed before the condition, and calculated every time). Furthermore, if side-effects are allowed in expressions, moving let operators (and their value expression) might completely break the expected behaviour of the side-effects. Of course this is also true with the 'brute-force method'.