June 19, 2009

Faulty character decoding as the last line of anti-spam defense

I receive spam every day. Filtering is in place and everything, but occasionally some garbage gets through. And then I may look through it, briefly, less than a second perhaps before I hit "Delete", but the eye is fast enough to read and understand more than I'd want to. Then you might say such kamikaze message still had succeeded.

Much of the spam I receive is in Russian. As a side note, Russian characters have multiple encodings - WIN1251, KOI8-R, CP866, ISO-8859-5 and the universal UTF-8 come to mind. This means that the mail client has to properly understand the encoding and decode the message so that it can be displayed correctly.

I use Thunderbird, and it is just awful in decoding Russian messages. I don't have any idea why is that, but I have to manually specify encoding for every last message, because they always appear garbled.


But then, the bug becomes an unexpected feature - the spam messages look undecipherable just like legitimate ones, and even though I look at it, nothing is imprinted in my mind, and I just hit "Delete".

March 30, 2009

Software architecture

is what you explain to somebody else so that he understands the matter.

February 05, 2009

This is Python: context managers and their use

Python allows the developer to override the behavior of pretty much everything. For example, as I explained before, the ability to override the "dot" operator makes all sorts of magic possible.

The topic of this post is similar magic enablers - "context managers", defined in PEP-343. I will also demonstrate one idiosyncratic context manager example.

To begin with, it is important to note that Python reasonably suggests that when a developer modifies the behavior (i.e. the semantics) of something, it is still done somewhat in line with the original syntax. The syntax therefore implies a certain direction in which a particular behavior could be shifted.

For instance, it would be rather awkward if you override the dot operator on some class in such way that it throws an exception upon attribute access:
class Awkward:
def __getattr__(self, n):
raise Exception(n)

Awkward().foo # throws Exception("foo")
It is a possible but very unusual way of interpreting the meaning of a "dot", which is originally a lookup of an instance attribute.

Having this in mind we proceed to the context managers. They originate from the typical resource-accessing syntactical pattern:
r = allocate_resource(...)
try:
r.use()
finally:
r.deallocate()
Such code is encountered so often, that it indeed was a good idea to wrap it into a simpler syntactical primitive. Context manager in Python is an object whose responsibility is to deallocate the resource when it comes out of its scope (or, context). The developer should only be concerned with allocating a resource and using it:
with allocated_resource(...) as r:
r.use(...)
In simple terms, the above translates to:
ctx_mgr = ResourceAllocator(...)
r = ctx_mgr.__enter__()
try:
r.use()
finally:
ctx_mgr.__exit__()
I note a few obvious things first:
  1. Context manager is any instance that supports __enter__ and __exit__ methods (aka context manager protocol).
  2. A specific ResourceAllocator must be defined for a particular kind of resource. The syntactical simplification does not come for free.
  3. Context managers are one-time objects, which are created and disposed of as wrappers around the resource instances they protect.
What is less obvious is that a class can be a context manager for its own instances, there need not be a separate class for that. For example, instances of threading.Lock are their own context managers, they provide the necessary methods and can be used like this:
lock = threading.Lock()
with lock:
# do something while the lock is acquired
which is identical to
lock = threading.Lock()
lock.acquire()
try:
# do something while the lock is acquired
finally:
lock.release()
Finally, I proceed to an example of my own.

See, I tend to write a lot of self-tests and I love Python for forcing me to. And some of the tests require that you check for a failure. Long ago I used to write code like this:
try:
test_specific_failure_condition()
except SpecificError, e:
assert str(e) == "error message"
else:
assert False, "should have thrown SpecificError"
which made my test code very noisy. I have even posted a suggestion that a syntactical primitive is introduced to the language just for that. It was rejected (duh !).

And then I wrote a simple "expected" context manager which makes exactly the same thing for me every day now:
with expected(SpecificError("error message")):
test_specific_failure_condition()
See how much noise has been eliminated ? How much clearer the test code becomes ? It is not a particularly "resource-protecting" kind of thing, but still in line with the original syntax, just like I said above.

The "expected" context manager source code is available here, please feel free to use it if you like.

To be continued...

December 28, 2008

No sense in sensors

I like buttons. I like handles. I like dials. I like doorknobs. I like doors for that matter. I like physical controls whose shape and feel suggests their usage and whose usage provides physical feedback. If it clicks, budges and moves, then it's good. When it is in expected position and its usage is apparent from its form, then it's good.

Sensor controls makes no sense to me. I hate smearing fingers against black glossy surface, with unclear outcome. Did it work ? Did I activate the right icon ? I hate it when controls are not really controls, but images on the glass. I hate it when controls change their places, look and functions depending on what I am doing.

Even my stove is black and glossy, with no buttons but tiny engraved white icons. Makes it easy to clean for sure, but using it feels nowhere like pressing a button. Oh well, at least the icons are always in same positions.

iPhone, yes I tried it. Could have spoken through a cigarette case instead. Doesn't feel like phone at all. Large flat nothing.

Now, why sensor controls are so popular these days then ?

The way I see it, sensor controls are cheap alternative to good interface design. See, if they knew what this thing would be used for, they could have spent time and money on design and give it a nice interface, specifically for its function.

But there is a problem - they don't know what the thing will be used for. Instead they plan to use it for something no one could imagine at the moment. And they don't want to cast it in stone. They want to leave their options open, so that the interface can be changed later through software update.

From the manufacturer point of view, the sensor panel is the ideal instrument to implement any interface they may need in the future. It is a way to secure investments, rather than make it more pleasant to use. And the rest is nothing but fashion, done through professional advertising, product placement and bandwagon effect.

Ideally, people need to be placed in a world with indifferent black walls, with the content dynamically downloadable from the BigCorp site. Virtual reality, that's what it is. Opaque screens instead of windows, so that you can choose a "view". Dumb sensor panels with fake buttons. Smaller packages with more useless contents. Things that you have no control over.

And I hate it. I like real things.

December 08, 2008

Pray to rest of the best remote banking solution in Russia

Alas, the bank I've been working for for the last five years got a lower hand in a merger. What it means to IT, does not need explaining. Everything we made is slowly dying out.

While it's still boiling, perhaps it's time to look back, and think about what's been done.

The good:

1. Still, our remote banking solution for the last two years in a row has been rated the best in Russia, and the forum full of client complaints for future shutting it down is also a good indication of that. And I am honored to belong to the team to have made it.

2. The side open source project that I have been developing for years during this employment, Pythomnic, I have luckily had time starting early this year to completely rethink, redesign and rewrite from scratch in new Python 3.0. It is a framework for integration in enterprise network using distributed network services. SOA, EAI, you name it. Essentially, this is what I have been doing for the last five years in the Internetbank project. We have even managed to write a few production services with the new framework and port a few from its previous version. If you don't mind me saying, it is a high quality piece of software, well (re)designed and (re)written. This project I will be working on for the years to come.

The bad:

1. The recession is getting worse. Not too good having to look for a job at times like that. Mind to take a look at my slightly outdated resume ?

2. I still can't force myself to release software the quality of which I consider low. What it means is that I tend to work thoughtfully and thoroughly, but yes, slowly. I could have argued for and against such approach myself, but not in this post. Anyway, such habits don't play well with modern freelancing. Who needs quality today ?

Therefore, pray to rest of the wonderful Internetbank project and if you like pay attention to the Pythomnic3k framework - I hope it is worth your attention.

October 16, 2008

Why do e-mails have subject ?

Real mail doesn't need subject, nor headers of any kind really. Could you imagine

From: Leo Tolstoy
To: Anton Chekhov
Subject: Re[2]: War and peace
Date: 16.03.1899

My dear Anton,
...

?

What is the point for e-mail to have headers anyway ? Some of them are transport level technical details. For example, To and From field serve about the same function as the physical letter envelope with handwritten addresses on it. But subject, what is in subject ?

It always takes me considerably more time to come up with a sound subject, and it still almost always says nothing about the contents of the letter. What's the point ?

Is it presumable e-mail volume, so that the user could just look over the long list of subjects without actually opening it ? Or is it limited space on 1970s terminal screens ? Or it is just a technical artefact for the sake of e-mail indexing, storing and referencing ?

Anyhow, right now, neither subject, nor From, nor To fields mean anything.

If a given e-mail is indeed a mail message sent to me, then I don't care about neither To (which is implicit - me), nor From (which is expected to be politely included in the body) nor subject (which, like I said is meaningless when written by a well-meaning sender). I simply open the message and read it entirely.

If, on the other hand, the e-mail is a spam, I care about From, To, or subject even less. I just trash it (in fact, my e-mail filter does it for me).

Then, either way, I care only about the contents, not about From, To or Subject. The key problem is really in separating letters from noise. But then From, To and Subject don't help it either. What's the point in having it ?

August 26, 2008

Google, DNS and finding stuff on the Internet

What if you've encountered Internet for the first time ? World-wide-web for that matter. Someone opens you a browser and says

- This is Internet, it has everything. Just type in an address of a site you want to visit.

Er, excuse me ? An address of a site I want to visit ??? WTF is that supposed to mean ? Anyone remember the address of the Pyramids ? I wouldn't mind visiting that particular site.

But really, what is a site address ? It is merely a reflection of a technical detail of the physical network organization. It just so happens that for the sake of unambiguous data delivery each computer on the Internet needs its own unique address. Now, the techies that invented it in 1970s just chose such address to be an integer number. If it was for them, or shouldn't the count of connected computers have exploded, numbers could have been used just as well:

- Connect me to server 12345 !
- You got it.

But people are notoriously bad in remembering numbers, and so there emerged a service similar to the yellow pages where each address could be given a name, and conveniently looked up later. Then it went like this:

- The new server is at great.new.site.com

and the user never bothered to translate "great.new.site.com" into 12345. The responsible domain name system (DNS), the ubiquitous service for looking up pieces of information by name is quite fascinating. It is perhaps the biggest distributed database in the world, and its capabilities have been largely underutilized over the years. May be this is why it is still up and running.

Presence of the DNS became as important as physical network connectivity. If there is no DNS, the Internet might as well be down. If you care to notice, it is exactly DNS where mainstream operating systems have their like only built-in redundancy. You are actually encouraged to configure multiple DNS servers at once, just in case one dies.

Well, DNS being a nice thing, it still got it own idiosyncrasies. There is really no reason for the site names to be organized in a dot-separated hierarchial fashion. In other words, in

www.yahoo.com

there is no need for neither "www" nor "com". Yahoo is the name, but the rest is irrelevant. The whole "dot separated" thing and "com" are just technical nuisances which made the development of DNS technically feasible, so that the database could be distributed more effectively. And "www" is nothing but a habit, a meme introduced to the culture. The sounds of "double u, double u, double u" and perhaps the visual rhythm of letters www immediately prepare anyone familiar with the Internet that a site address is being transmitted. Synchronization bits if you like.

So, what matters is the "yahoo" part, right ? The name. But the name of what and what's in a name ?

First, I'll go about the "name of what" part. World wide web is de-facto a hypertext, a billion of files intertwined with mutual links. Accordingly, what you type in is but an entry to the web. Once inside, you neither type nor care to remember any more names nor addresses, you just keep following the links. Have you ever stared at a blank browser page trying to invent another name which to type in just to see what comes up ? That's the idea. Any name could be tried as entry gateway, but picking them at random is extremely ineffective. Whenever one has multiple entry points to the web, he has to write them down, which is a starting point for a personal bookmark catalogue, doubtfully a popular sport any more. Instead it happens that everyone has like ten favorite entry points to the web, the ones that are fashionable, familiar, have catchy names or refer to the person's location or interests. Ok, so each user has his own favorite entry points to the web and they are the only ones that need names.

What's in a name then ? Oh, it is then totally irrelevant what exactly the name is. www.google.com, www.wikipedia.org, www.reddit.com, www.e1.ru, www.kazna.ru whatever is meaningless but catchy or meaningful but easy to remember in connection to some relevant topic.

Google is a catchy name and it presents the most rich and the most poor entry page at the same time. See, it might look like it helps, when you type www.google.com and the simplest possible page pops up and says: hi there, just type in what you need. But it is the same question we have started from - just type in what you need ! The only difference is that before we had to type the name of a single site, presumably known beforehand. Now we have to try keywords until we find something.

One point here is that the DNS names of the sites are largely irrelevant. A name of a site used to be the single keyword available for finding it, but no more. Now you are far more likely to find a site through a right query to google.

Another point is that is that google and the likes perform the same function DNS was supposed to - for relieving the user from remembering addresses and looking up relevant sites. Truly distributed DNS mapping site names to addresses became the part of the physical network (on the right ISO layer if you care), and got replaced by centralized mammoth server farms that map keywords to pages.

Finally, this switch gave enormous power to a proficient user, but for the average user it is still a blank stare at

- This is Internet, it has everything. Just type in what you want to find.

Er, excuse me ?

August 22, 2008

This is Python, calling a spade a spade

Python is a high level programming language, but what does this term mean ? What does it mean for a language to be high level or low level ? Can you compare height levels of different languages ?

The meaning for the term is nebulous and there is no single or final definition. Here is one approach - the more effectively the language allows you to handle things, the higher level it is. And by things I'm not meaning just objects as in classes instances. Things, you know, everything, even if I occasionally call them objects.

Enter the notion of first-class objects. Put simply, something is called first-class object in a programming language if it can be treated just like an instance of primitive type, such as int. For example, when you declare a variable (which is a valuable feature already, to be able to declare a variable of that kind)
int i;
you then can do all sorts of things with it, such as passing it as a parameter:
foo(i);
return it as a result of function:
return i;
and do other things, depending on the language. The point is that first-class objects can be handled more effectively and provide additional flexibility. Thus, the more objects in a language are first-class, the higher level that language is.

In Python pretty much everything is first-class. I won't be digging into language reference to find whether or not it is formally true, but in practice it is just like that. It is partly because Python is an dynamically typed language with referential variables semantics - as soon as something exists, you should be able to get a reference to it, and then, once you have a reference, you pass it around as a primitive, not caring about the nature of the object it points to. The language itself does not care what kind of an object is being referenced by the variable you pass. It is only when it comes to real work, such as access to the object's methods, it may turn out to be incompatible with the operation you throw at it. Such just-in-time type compatibility is a very old idea and is called "protocol compatibility" in Python.

Why is it good ? Because I can call a spade a spade. If I need to pass a class as a parameter, what a heck, I can do it:
def create(c, i):
return c(i)

create(int, 0)
See ? Generic programming right there.

Or, why wouldn't I be able to pass in a method ?
def apply(f, x):
return f(x)

def mul_by_2(x):
return x * 2

print(apply(mul_by_2, 1)) # prints 2
Uhm, was it functional programming ?

One other curious and extremely useful first-class thing, which you wouldn't find in many other languages is the call arguments. Remember, I have said that before, there is no declarations in Python. Compatibility of a called function with the actually supplied arguments is checked just-in-time, just as anything else:
def foo(a):
...
foo(1, 2) # this throws at runtime
But nothing stops you from writing a function which accepts any arguments:
def apply(f, *args):
return [f(arg) for arg in args]

apply(mul_by_2, 1)
apply(mul_by_2, 1, 2)
...
And the point is - inside the apply function args is a variable that references a tuple of the actually passed arguments:
def apply(*args):
print(args)

apply(1, 2, 3) # prints (1, 2, 3)
there may be just a little stretch about calling args a first-class object being "arguments to the call", but practically it is just that. Imagine the flexibility of things you can do with it.

Anyway, in conclusion I will demonstrate another situation where calling a spade a spade is good. A state machine. An object with a state, and a set of state transition rules. What would it typically be ?
class C:

def __init__(self):
self._state = "A"

def _switch(self, to):
self._state = to

def _state_A(self):
print("A->B")
self._switch("B")

def _state_B(self):
print("STOP")
self._switch(None)

def simulate(self):
while self._state is not None:
if self._state == "A":
self._state_A()
elif self._state == "B":
self._state_B()

C().simulate() # prints A->B STOP
This is a quickly drawn together sample, so please don't be too picky. The problem with it, which I will try to eliminate is this - you have two kinds of way to represent the same thing - the state. What is the reason for aliasing _state_A by "A" and _state_B by "B" ? Oh, the last letter matches, I see... And what's the point in having the state-by-state switch in simulate ? Why don't we just call a spade a spade ?
class C:

def __init__(self):
self._state = self._state_A

def _switch(self, to):
self._state = to

def _state_A(self):
print("A->B")
self._switch(self._state_B)

def _state_B(self):
print("STOP")
self._switch(None)

def simulate(self):
while self._state is not None:
self._state()

C().simulate()
In this second example, I don't have any arbitrary aliases for state, instead I use for a state its own handler. A method which handles a state is a state here. It simplifies things just a bit - the switch is gone, and it is overall more clean and consistent to my taste.

Well, that's about what I had to say.

Python being a high level language... Other factors, such as wide variety of built-in container types and huge standard library also help Python to be higher level than many other languages, but it's another story.

To be continued...

August 20, 2008

XML is like plankton in the information ocean

as huge amounts of it float around to be consumed by everyone.

August 17, 2008

Bosons, my ass

Higgs boson, they say, is the reason for wasting gazillions of euros on a high-tech circular tunnel.

So how come we still use portable energy sources that date back to 1800 and are only capable of only giving a 3000 mAh of power ? How come we can't purposely transfer a significant amount of energy wirelessly, through the air, without having to wear radiation-proof costume ? Speaking of which, why radiation protection is still 10m of lead ? Kind of limits space travel you know.

Higgs boson, when you discover it, you know what to do with it.