Things That Require Further Thinking

September 16, 2016

On death march

With every commit, a question "the right way, or the usual way ?" sounds more and more sarcastic.

September 02, 2016

On improving movie-going experience

Do you know how people sometimes say that certain movies you just cannot watch without having a beer or two before ? So why not having alcohol bottled in different sizes and shapes, labeled with names of actors and/or directors and given out according to movie cast ? So that if you come to see "Suicide squad", you'd get yourself a large "Will Smith" beer, a small "Jared Leto" beer, and a shot of "Comix adaptation" vodka.

August 23, 2016

We have lost

I used to know every file on my computer. I used to open every single file in a hex editor to see what's inside. Actually, this is still my first reaction to understand what a file is.

In the programs that I wrote I used to know every line, every character even. Then they got bigger, but I still knew every file. Then they got bigger yet but at least I knew all the dependencies not only by name but by virtue.

Now a static single page web site in React comes with over 100000 files, some of which, as you know, are left pads and hot pockets and there are all new contests. Such complexity is beyond anyone's reach. Even the authors find it overwhelming.

We have lost the understanding of what's happening.

Today's software development is not engineering as such. Given the quality of the outcome it is not even a professional activity. Software development has become a recreational game for the young. The target audience for the programmer became not the users but other programmers. Stackoverflow and Github saw to that. This is now the most active social network in the world.

To impress one's peers it's no longer necessary to build some quality software. Bah ! You can't even tell what quality means. But to ride bicycle without hands - that's something ! And if you could do it blindfolded ! And backwards !

And so we see thousands of exercises in making things backwards. Without understanding the purpose or the reason, pick up a new tool, play for a month then move on to a new stimuli. Worse yet if it leaves behind another backwards-backwards-backwards-backwards tool. This adds another layer of useless and not understood complexity and provides positive feedback to the loop.

I remember well one day in 2005, when something out of the ordinary happened. At the time I was working under supervision of a great software engineer. He was always talking about "architecture", you know. Back then I didn't understand it at all, despite having already worked as a programmer for some 8-9 years. I thought it was all managerial talk. And then I was sitting in a conference room alone thinking how to organize a UI for some application and it dawned on me. I knew what architecture was, not burdened with details, my mind went to the next level, it was almost like I could fly. That I could not forget or unlearn, and I'm happy that this knowledge is with me, because it would not have happened today.

Today I would have just be dabbling in a Sea Of Complexity, pleasing my mind with details. May be I would have been happy about that, who knows.

July 15, 2016

On cookie consent

In case you didn't know what a "cookie consent" is, behold - according to EU legislation, web sites must ensure the user accepts their usage of cookies.

And I'm sick and tired of its effects.

It is an outstanding example of what happens when people that don't have the faintest come to control technology, in this case the Internet. For each web site to prompt the user about cookies is a terrible idea.

1. The users don't understand it. They don't know what cookies are or how they work, to them it is meaningless noise. It is not a consent, it is just another "some technical message, just click OK" moment. Also, the word "cookie" itself adds a little touch of insanity.

2. The users don't take it seriously. Even when the warning is straightforward (we need cookies to do something you may not like) it is a matter of a single click to close the annoying window.

3. It does not improve privacy. At all. From privacy standpoint, cookies are not the villains but the most innocent messengers that are being shot.

4. It makes the Internet more stressful. As if we had not enough banners, one-time offers, subscription popups, landing pages, paywalls and so on, now we have these noisy popups.

5. Technically, cookie consent is a catch-22 situation - to know whether to accept a cookie from a site you need to own a cookie from it. Therefore if you refuse, the site will ask again. Moreover, even if you accept, each browser on each device manages its own cookies, and only a limited number of them. So the questions will continue ad nauseam.

May 01, 2016

Python function guards

TL;DR: Rationale and thoughts behind the implementation of Python 3 function guards.

I really love Python, but unfortunately don't have to use it in my current daily job. So now I have to practice it in my spare time, making something generally useful and hopefully suitable for improving my Python application framework.

1. The idea

I already had a method signature checking decorator written years ago, and it turned out enormously useful, so in the same line I started thinking about whether it would be possible to implement declarative function guards that select one version out of many to be executed depending on the actual call arguments. In pseudo-Python, I would like to write something like this:

def foo(a, b) when a > b:
  ...

def foo(a, b) when a < b:
  ...

foo(2, 1) # executes the first foo

2. Proof of concept

At the first sight it looks impossible, because the second function kind of shadows the second one:

def foo(a, b):
  print "first"

def foo(a, b):
  print "second"

foo(2, 1) # second

but this is not exactly so. Technically, the above piece of code looks something like this:

new_function = def (a, b): print "first"
local_namespace['foo'] = new_function

new_function = def (a, b): print "second"
local_namespace['foo'] = new_function

and so the problem is not the function itself which is overwritten, but its identically named reference entry in current namespace. If you manage to save the reference in between, nothing stops you from calling it:

def foo(a, b):
  print("first")

old_foo = foo

def foo(a, b):
  if a > b:
    old_foo(a, b)
  elif a < b:
    print("second") 

foo(2, 1) # first
foo(1, 2) # second

so there you have it, what's left is to automate the process and it's done.

3. Syntax

There is no question as to how the guard should be attached to the guarded function - it would be done by terms of a decorator:

@guard
def foo(): # hey, I'm now being guarded !
  ...

@guard
def foo(): # and so am I
  ...

but the question remains where the guarding expression should appear. I see six ways of doing it:

A) as a parameter to the decorator itself:

@guard("a > b")
def foo(a, b):
  ...

B) as a default value for some predefined parameter:

@guard
def foo(a, b, _when = "a > b"):
  ...

C) as an annotation to some predefined parameter:

@guard
def foo(a, b, _when: "a > b"):
  ...

D) as an annotation to return value:

@guard
def foo(a, b) -> "a > b":
  ...

E) as a docstring

@guard
def foo(a, b):
  "a > b"
  ...

F) as a comment

@guard
def foo(a, b): # when a > b
  ...

Now I will dismiss them one by one until the winner is determined.

Method F (as a comment) is the first to go because implementing it would require serious parsing, access to source code and be semantically misleading as the comments are treated as something insignificant which can be omitted or ignored. The rest of the methods at least depend on the runtime information only and work on compiled modules.

Method A (as a parameter to the decorator) looks attractive, but is dismissed because it moves the decision from the function to the wrapper. So the function alone can't have guard expression and therefore it would not be possible to separate declaration from guarding:

def foo(a, b): # I want to be guarded
  ...
# but it is this guard here that knows how
foo = guard("a > b")(foo)

The rest of the methods are more or less equivalent and the choice is to personal taste. Nevertheless, I discard method E (docstring) because there is just one docstring per function and it has other uses. Besides, to me it looks like it describes the insides of the function, not the outsides.

So the final choice is between having the guarding expression as annotation and as default value. The real difference is this: a parameter with a default value can always be put last, but a parameter with annotation alone can not:

def foo(a, b = 0, _when: "a > b") # syntax error
  ...

This and the fact that aforementioned typecheck decorator already makes use of annotations tips the decision towards default value:

@guard
def foo(a, b, _when = "a > b"):
  ...

@guard
@typecheck 
def foo(a: int, b: int, _when = "a > b") -> int:
  ...

The choice of a name for the parameter containing the guard expression is arbitrary, but it has to be simple, clear and not conflicting at the same time. "_when" looks like a reasonable choice.

4. Semantics

With a few exceptions, the semantics of a guarded function is straightforward:

@guard 
def foo(a, b, _when = "a > b"):
  ...

@guard 
def foo(a, b, _when = "a < b"):
  ...

foo(2, 1) # executes the first version
foo(1, 2) # executes the second version
foo(1, 1) # throws

Except when there really is a question which version to invoke:

@guard 
def foo(a, b, _when = "a > 0"):
  ...

@guard 
def foo(a, b, _when = "b > 0"):
  ...

foo(2, 1) # now what ?

and if there is a default version, which is the one without the guarding expression:

@guard
def foo(a, b): # default
  ...

@guard
def foo(a, b, _when = "a > b"):
  ...

foo(2, 1) # uh ?

and the way it seems logical to me is this: the expressions are evaluated from top to bottom one by one until the match is found, except for the default version, which is always considered last.

Therefore here is how it should work:

@guard
def foo(a, b):
  print("default")

@guard 
def foo(a, b, _when = "a > 0"):
  print("a > 0") 

@guard 
def foo(a, b, _when = "a > 0 and b > 0"):
  print("never gets to execute")
 
@guard 
def foo(a, b, _when = "b > 0"):
  print("b > 0")

foo(1, 1)   # a > 0
foo(1, -1)  # a > 0
foo(-1, 1)  # b > 0
foo(-1, -1) # default

5. Function compatibility

So far we have only seen the case of identical function versions being guarded. But what about functions that have the same name but different signatures ?

@guard
def foo(a):
  ...

@guard
def foo(a, b):
  ...

Should we even consider to have these guarded as versions of one function ? In my opinion - no, because it creates an impression of a different concept - function overloading, which is not supported by Python in the first place. Besides, it would be impossible to map the arguments across the versions.

Another question is the behavior of default arguments:

@guard
def foo(a = 1, _when = "a > 0"):
  ...

@guard
def foo(a = -1, _when = "a < 0"):
  ...

Guarding these as one could work, but would be confusing as to which value the argument has upon which call. So this case I also reject.

What about a simplest case of different names for the same positional arguments ?

@guard
def foo(a, b):
  ...

@guard
def foo(b, a):
  ...

Technically, those have identical signatures, and can be guarded as one, but is likely to be another source of confusion, possibly from a mistake, typo or a bad copy/paste.

Therefore the way I implement it is this: all the guarded functions with the same name need to have identical signatures, down to parameter names, order and default values, except for the _when meta-parameter and annotations. The annotations are excused so that guard decorator could be compatible with typecheck decorator. So the following is about as far as two compatible versions can diverge:

@guard
@typecheck
def foo(a: int, _when = "isinstance(a, int)", *args, b, **kwargs):
  ...

@guard
@typecheck
def foo(a: str, *args, b, _when = "isinstance(a, str)", **kwargs):
  ...

Note how the _when parameter can be positional as well as keyword. This way it can be always put at the end of the parameter list in the declaration.

6. Function naming

Before we used simple functions, presumably declared at module level. But how about this:

@guard
def foo():
  ...

def bar():
  @guard 
  def foo():
    ...

class C:
  @guard 
  def foo(self):
    ...

those three are obviously not versions of the same function, but they are called foo() so how do we tell them apart ?

In Python 3.2 and later the answer is this: f.__qualname__ contains a fully qualified name of the function, kind of a "a path" to it:

foo
bar.<locals>.foo
C.foo

respectively. It doesn't matter much what exactly is in the __qualname__, but that they are different, just what we need. Prior to Python 3.3 there is no __qualname__ and we need to fallback to a hacky implementation of qualname.

7. Special cases

Lambdas are unnamed functions. Their __qualname__ has <lambda> in it but no own name. They would be impossible to guard:

foo = lambda: ...
foo = guard(foo)

bar = lambda: ...
bar = guard(bar)

because from the guard's point of view they are not "foo" and "bar", but the same "<lambda>".

An interesting glitch allows guarding classmethods and staticmethods. See, classmethod/staticmethod are not regular decorator functions but objects and therefore cannot be stacked with guard decorator

class C:
  @guard # this won't work
  @classmethod
  def foo(cls):
    ...

because classmethod can't be seen through to the original function foo. But it gets interesting when you swap the decorators around:

class C:
  @classmethod
  @guard
  def foo(cls, _when = "..."):
    ...
  @classmethod
  @guard
  def foo(cls, _when = "..."):
    ...

the way it works now is that guard decorator attaches to the original function foo, before it's wrapped with classmethod. Therefore the guarded chain of versions contains only the original functions, not classmethods. But when it comes to the actual call to it, it goes through a classmethod decorator before it gets to guard, the classmethod does it argument binding magic and whichever foo is matched by guard to be executed, gets its first argument bound to class as expected.

8. The register

Here is one final question: when a guarded function is encountered:

@guard
def foo(...):
  ...

where should the decorator look for previously declared versions of foo() ? There must exist some global state that maps function names to their previous implementations.

The most obvious solution is to attach a state dict to the guard decorator itself. The dict would then map (module_name, function_name) tuples to lists of previous functions versions. This approach certainly works but has a downside, especially considering I'm going to use it with Pythomnic3k framework. The reason is that in Pythomnic3k modules are reloaded automatically whenever source files containing them change. Having a separate global structure holding references to expired modules would be bad, but having a chain of function versions cross different identically named modules from the past would be a disaster.

There is a better solution of making the register slightly less global and attach the state dict to the module in which a function is encountered. This dict would map just function names to the lists of versions. Then all the information about the module's guarded functions disappear with the module with no additional effort.

9. Conclusion

The implementation works.

I'm integrating it with Pythomnic3k framework so that all public method functions are instrumented with it automatically, although it is tricky, because when you have a text of just a

def foo(...):
  ...
def foo(...):
  ...

and you need to turn it into

@guard
@typecheck
def foo(...):
  ...
@guard
@typecheck
def foo(...):
  ...

it requires modification of the parsed syntax tree. I will have to write a follow-up post on that.

That's all and thanks for reading.

April 18, 2016

Rent-a-battery ?

Why is there a service where you can instantly rent a car but no such service where you can instantly rent a precharged cell phone battery ? Every time your cell phone battery runs out it is a huge pain to find a charger and/or socket to plug it in. Wouldn't it be nice to have an over-the-counter rent-a-battery at every cashier ?

November 27, 2015

On Emoji

I'm having hard times grasping this originally Japanese thing. The word rhymes with "emotion", and although in Japanese it had no such meaning, it apparently was intended to express a notion, feeling or emotion in a single hieroglyphic depicting some character or other. It just so happened that after some time it became a (Unicode) standard alternative to smileys and other pictographs.

Speaking of smileys, I may have an emotional range of a teaspoon, but I can't tell what most of those emoji faces mean. Each time I pull up the emoji palette in an application, I'm always stuck at which to pick, despite of seemingly wide choice. They don't convey any emotion I can possibly want to express. Don't get me wrong, they may be perfectly suited to express a notion of a pile of shit but this is not what I need from an emotion. And even with faces, wtf ?

For the purpose of illustration I've picked a few, but you can imagine the rest. Here, see for yourself, and mind that it is an international standard, no less.

So I believe the entire emoji thing was a designer's experiment, a hip toy, which was backed by the Unicode consortium in order to keep filling their seemingly endless code pages, and from there application developers picked it as a "standard" way smileys should be done. Quite unfortunate really.

Please, PLEASE, use something like Kolobok, or hire a designer and at least make your smileys look like Skype's.

Монти Механик

Вот так, прослушав сотню случайных групп, наткнешься на нечто выдающееся. Монти механик - группа из Челябинска (что тоже приятно - соседи), играющая музыку, так ложащуюся на душу.

July 07, 2015

Helium shoelaces

Shoelaces should really be lighter than air, so that they float up and you never step on them. Inflatable with helium most likely.

April 09, 2015

Rethinking the cache

Caching is a very difficult problem. Not just invalidation and not because of some inherent algorithmic complexity, but because of the complex semantics of the process and its purpose of increasing performance.

In one of our products I had once written a specific application mini-ORM for caching objects persisted in a database. It was a Python 3 application in Pythomnic3k framework and a companion PostgreSQL database.

The database structure was really simple - one table per class named simply like "users" or "certificates" and a handful of stored procedures named like "users__load" or "certificates__save". The tables contained the actual data and a few special fields, like "checksum", which allowed to detect concurrent update conflicts optimistically.

So each time there was an ORM access in the code

user = pmnc.db_cache.User(user_id = 123)
user.passport_number = "654321"
user.save()

what happened behind the scene was that ORM implementation in db_cache.py executed

SELECT *, checksum FROM users__load(123)

then created an instance of class User based on returned data, one property per database column, cached the instance and returned it to the caller. After the passport_number property was modified the call to save executed

SELECT new_checksum
FROM users__save(123, ..., '654321', ..., checksum)

to flush changes to the database. Should another request for user 123 arrive in the meantime

user2 = pmnc.db_cache.User(user_id = 123)

it would have returned the cached instance and not go the database.

As any ORM, this one did not answer all the questions. It did not allow ad-hoc queries which did not map directly to object paradigm and it was impractical to create a separate method for every request. Therefore, ad-hoc queries started finding their way directly to the application code like

pmnc.transaction.db.execute(
"
  SELECT u1.last_name, u2.last_name
  FROM users u1 INNER JOIN users u2
  ON u1.passport_number = u2.passport_number
  WHERE u1.suspended = {suspended}
",
suspended = True)

So now there were requests that bypassed the ORM cache and went straight to the database. This was still normal as soon as they were read-only. But later there were more of them and they were getting more analytical and heavy (think history of changes of all objects belonging to a particular user), therefore a question of caching appeared again.

It was then that I implemented and released a universal read caching mechanism for Pythomnic3k (version 1.4), so that it was possible to enable cache on any resource, database being the most probable candidate of course. I wanted to put it to production and make all the requests including ORM's go through the resource cache. The first thing that surfaced immediately was that read caching implementation was pretty much useless as it was, because while caching reads it did not take into account the existence of concurrent writes.

Actually, I knew this before the release, but simply had not enough understanding of how such concurrent cache activity should behave. But I left a couple of callable hooks that allowed to customize the cache behavior on application level. So I hooked into the cache and made reads and writes coexist. Because the cache couldn't tell whether such and such SQL request had side effects, the fact had to be declared by the application with each call. In simple terms a read like this

pmnc.transaction.cached_db.execute(
"
  SELECT *, checksum FROM users__load({user_id})
", 
user_id = 123, 
pool__cache_read_keys = { "users/123" })

would later have its cached result invalidated by conflicting write

pmnc.transaction.cached_db.execute(
"
  SELECT new_checksum FROM users__save({user_id})
", 
user_id = 123, 
pool__cache_write_keys = { "users/123" })

This way it went to production and worked fine for about half a year. Until one fine after-deployment morning it didn't.

There was another thing not taken into account, a race condition between reads and writes. For example, if these two conflicting request executed concurrently:

pmnc.transaction.cached_db.execute(
"
  SELECT WHERE user_id = 123
")
pmnc.transaction.cached_db.execute(
"
  UPDATE WHERE user_id = 123
")

there was a chance that the read would start before the write but end after the write. In this case write wouldn't invalidate the result of read simply because there was none at the moment, but the result would still arrive and be cached containing already invalid data. The problem was resolved by patching in an industry-standard sleep() in the right place, and it indeed remedied the situation. But now I started to rethink the entire thing. Clearly, caching semantics needed to be improved.

So I went and made a lot of changes to the cache code, using new experience, focusing on concurrent reads and writes behaviour. In particular, the above race condition was fixed by registering affected cache keys before any request, read or write, is sent to the database. This way if a write arrives when a conflicting read is in progress or the other way around, both are allowed to proceed but read result is not cached when read returns. The result is still returned to the caller and it may or may not be invalid but now it is the responsibility of the database, not the cache, so we did not break the semantics.

Now, as I was overhauling the cache anyway, I also wanted to examine the evidence.

First, I picked up what log files with enough debug information we had from production installations of our product and read through the registered SQL queries. Predictably enough, they fell into two categories:

Ad-hoc queries. Fewer but slower.
ORM queries. A lot more numerous but also much faster.

Second, I checked the cache settings on database connections and they were not right. The cache sizes were too small and the "weight" eviction policy exclusively favored the requests of the first type, which produced less hits. By weight we mean the time it took to originally produce the actual value from the database.

So I thought it would be nice to improve the cache by accounting not only weight of the cached result, but also "usefulness", which would be weight multiplied by hit count. And so I added such eviction policy, only it was called "useless", as in "evict useless entries first".

Some time later I thought that since ORM produces a lot of identical parametrized queries:

SELECT * FROM users__load(?)

it would be reasonable to suggest that each entry cached from any such request has the same average usefulness. For example, if we have 1000 entries cache hits to which saved 1 second each, and there is a 10 new entries that have just been entered and did not have a chance yet, the newcomers should not be evicted right away. It would be better to let them stay hoping that each will come as helpful as the 1000 before them.

Therefore I added an optional "cache group" parameter for a query, the simplest kind of one is the literal SQL string itself. As entries produced by the same SQL string are entered to the cache, they are assigned to the same cache group and have their usefulness accounted combined. Even though the new entry may not have had any hits yet, it is under the umbrella of the high-ranked group with high average saving.

Eviction now had to work differently. At first I though that I would simply evict the low-ranked groups first, leaving the high-ranked ones intact if possible. But experiments indicated that one winning group simply took over the entire cache over time. So I had to implement eviction using weighted average amongst groups, where a high-ranked group has a better multiplier than a low-ranked one. This means that a value from a former group could still be evicted if it has low weight, and likewise a high weight value from a latter group could stay.

Fiction mode off. I have just committed all this code to Pythomnic3k SVN repository, and it will be in the next version, so anyone who is interested may check it out, the cache should now be usable. Although making it work right in an application may not be obvious, I will later include a sample specifically with cached access to a database.