slawomir.net

Heirloom on journal cover

2020-09-21T00:00:00+02:00

Couple of months ago I wrote an article to “Programista” journal (a Polish one) about how DEFLATE algorithm works under the bonnet. Apart from describing DEFLATE it illustrates clever use of a indexed_gzip to decompress random part of gzipped file without decompressing what’s before. I got it to cover and thought it’s a good chance to put some hidden information over there. I decided to put heirloom for my children. I wonder what will be their reaction when they’re in their 20’s and somebody tells them :)

Automating full-page web screenshots without ads and other crap

2020-09-21T00:00:00+02:00

Have you ever hit a wall with your idea/project? I recall that the other day I heard following words about my project.

we won’t invest, because we aren’t sure if that boat will become a ship in a future

Cruel words, right? Well. Later on it emerged that they were right with their judgement. Not every boat becomes a ship. Just few do. One of the most exceptional example is the Internet. It was a boat and it became a ship. Comparing it a ship is actually not fair, but you get it.

When the boat becomes a ship it can accommodate much more people, it requires more power to operate, it looks much better, it is more robust, it’s more powerful and offers more services, it’s less maneuverable etc. All of this is true for the Internet too. Target audience is much broader and therefore the goals become different. Revenue streams are different too. Last but not least, technologies and solutions from the past are not suitable anymore and that can create new technical challenges for people that work with them.

In this post I want to describe how and why I automated taking full-page screenshots of web pages without advertisments, GDRP notifications, Cookie/privacy alerts etc. Back then it was pretty easy thing to do. That, unfortunately, doesn’t hold anymore.

Information in the Internet was always ephemeral. What if you find some information on some web page and store URL for future? You may go back after a while to discover it’s already gone (404). It may have been moved somewhere else, or maybe it’s not available anymore. Yes, you can use search engine to find another source of information about the topic, but only if the URL contains some key words, or you were precautious enough to copy page title apart from the URL. I created small system to make full-page screenshots of pages to circumvent this whole problem.

In the past you could simply use curl to create your own copy of a page, but unfortunately nowadays it is not that straighforward:

lot of pages require JavaScript, and significant part of them require JavaScript to load the content
some pages are protected (e.g. against DDoS) and will check the browser
there is a lot of bloat (GDRP, Cookie and privacy policies, full-page advertisments)
pages are resource-heavy
pages require lot of content that come from 3rd-party services (CDNs etc)

You can use service like archive.is, but for the sake of the article I’m gonna assume you want your own local copy. There are many ways how we can make a copy of a website, but I’m going to focus on the most primitive one: full-page screenshots. I find screenshots easy to preview and easy to share.

So what are challenges of automating web page screenshots?

in order for everything to work right we need underlying browser to render the page for us
we need to take full-page screenshot, so simple screen grabbing won’t work
page contents are gathered asynchronously, so we need to somehow instrument the browser to take screenshot only after page is fully loaded
we need to hide all GDRP, TOS, Cookie windows before taking a screenshot
since we want automated solution, we look for headless solution (no rinning X’es)

Points 1, 2, 3 and 5 are solved by using headless automated browser like PhantomJS. We will use webscreenshot Python package as a wrapper. It contains convenient script to take full-page screenshot.

To solve point 4, we will use mitm-adblock that uses mitmproxy under the bonnet. Basically it forms a HTTP(S) proxy that will reject JavaScript scripts according to Adblock rules. These are the same rules that are used by browser extensions like Adblock, uBlock etc.

Picture above illustrates how the system works. After cloning mitm-adblock we cd into its directory. When running for the first time, we should execute update-blocklists to update Adblock rules. Then we execute go script in background (or foreground in another terminal).

Second step is to pull webscreenshot and cd into its directory. Assuming that list of URLs is prepared and available in file /tmp/links.txt we do following:

for link in $(cat /tmp/links.txt); do python3 webscreenshot.py -P 'http://localhost:8118' "$link"; done

Where localhost:8118 is endpoint of our mitmproxy. Depending on how many links we have, it may take some time. When it finishes we should have all of our screenshots available in screenshots subdirectory.

And that’s all!

Caveats and further steps

Described solution is definitely not complete, but it was enough for my pet project. Problems that I’ve encountered include:

some pages require logging in. This could be solved in numerous ways: e.g. hooking in into webscreenshot.js, custom logic in mitmproxy, or injecting apropriate cookies.
Adblock rules don’t cover everything. I had to hack mitm-adblock a little to e.g. block optad360 and statsforads sites
some sites like e.g. Twitter don’t work well with the solution. This can be solved by putting custom code in webscreenshot.js, though

There are also some further enhancements features I see:

scrapping page HTML, so one can do a quick grep to find information
using tesseract or other OCR on page screenshot to extract visible text instead of pure HTML
crop page screenshots to exclude meaningless space, to save space (my screenshot dir is about 700MB)

babla: command line translation tool (Polish-English)

2020-09-21T00:00:00+02:00

This is gonna be pretty short. Some time ago I created small script that uses bab.la web service to translate words between Polish and English. Some people from my team are already using it and found it to be convenient.

$ pip3 install babla
$ babla sumienny
conscientious
dutiful
assiduous
faithful 

Automating vimdiff’s HTML diff (TOhtml)

2020-08-20T00:00:00+02:00

Proof of concepts are often done pretty differently than other kind of work. The rules described in (in)famous “Effective engineer” are even more important to succeed with PoCs. At the end of a day it all gravitates around “value added divided by effort spent”.

I was doing some quick PoC project that lets user paste log fragment and see what are other log fragments in the database that are most similar to the pasted one. But apart from showing the list user wants to inspect the differences too.

Simplest solution is to generate diff in e.g. HTML and present it to the user. But what is the best way to achieve that quickly? By using existing tool!

This post is about automating vimdiff’s TOhtml using headlessvim library. Apart from presenting actual solution I cover some of the details on how Vim loads configuration files, how can we pass commands to it, etc.

Problem that my PoC is solving is Nokia-specific. No worries, though. We can use Linux kernel logs to explain what is going on.

Imagine there’s a bug in the kernel you’re using. You inspect dmesg and see some multi-line crash. Now, to make it hard, there’s no Google/DuckDuckGo. But you have access to gazillions of logs from executions from other machines with a description of the solution attached. What if you could search for similar crashes in these logs and check if the root cause is the same?

Describing my algorithm that finds most similar log fragments is out of scope of this post. I’m just gonna describe the last step done by the user: inspecting the diff between queried log fragment and the match that has been found.

Subprocess

Standard way to compare two files in vimdiff is:

vimdiff fileA fileB

and that is equivalent to:

vim -d fileA fileB

Now, the problem is that we need to somehow instruct vim to automatically execute some commands right after loading files. Fortunately there’s -c switch:

  -c {command}
               {command} will be executed after the first file has been read.  {command} is interpreted as an Ex
               command.  If the {command} contains spaces it must be enclosed in double quotes (this depends  on
               the shell that is used).  Example: Vim "+set si" main.c
               Note: You can use up to 10 "+" or "-c" commands.

So it becomes:

vim -d fileA fileB -c 'TOhtml' -c 'sav! output.html' -c 'qall!'

We also want for it to work on “vanilla vim”, so without plugins and rc files. Additionally it would be bad if the invocation had side effects on the disk. We achieve these goals by adding -i NONE -n -N -u vimrc, where:

-i NONE - disables writing viminfo file
-n - don’t use swap files. Thanks to this we can execute parallel vim processes without risk of seeing the recovery message
-N - disables vi compatibility
-u vimrc - instructs vim to load configuration from the vimrc file in current work dir

We need to create vimrc file for this to work. If we used NONE as an argument to -u then TOhtml command wouldn’t work. What we need to do instead is to mimic how vim is initialized in our distro.

I’m working on Debian, so my vimrc file consists of just single line:

runtime! debian.vim

This debian.vim file comes from the runtime path which is set to /usr/share/vim/vim82/ on my box. I’m not a magician. I’ve copied that line from /etc/vim/vimrc.

The final version is:

vim -i NONE -n -N -u vimrc -d fileA fileB -c 'TOhtml' -c 'sav! output.html' -c 'qall!'

And voille-a! You can now open output.html file and see what was generated. Upon running the command you’ll see vimdiff interface for a fraction of second.

Putting this into Python script is pretty straighforward:

import subprocess

p = subprocess.Popen([
  "/usr/bin/vim", "-i", "NONE", "-n", "-N", "-u", "vimrc", "-d", "fileA", "fileB", "-c", "TOhtml",
  "-c", "sav! output.html", "-c", "qall!"
])
p.communicate()

But this solution is bad, because:

it’s fragile as hell
it explicitly uses subprocess so it literally asks for some thin wrapper
it requires TTY and takes over it. Watching server logs looks really funny :)

headlessvim

Fortunately there’s a library that uses pyte (fake terminal implemented in Python) and wraps the Vim process into handy API. Just look at how better the code is when we use that library:

import headlessvim

with headlessvim.open(args=f"-N -i NONE -n -u ./vimrc -d {fileA_path} {fileB_path}") as vim:
    vim.command("TOhtml")
    vim.command(f"sav! {output_path}")

The library was initially developed to help write unit tests for Vim plugins, but can be used in many more scenarios, including ours.

Accessing globals after wrong code.interact() call

2020-06-26T00:00:00+02:00

def foo():
    global database_index
    bar()
    import code; code.interact(local=locals())

Have you ever called code.interact() and forgot to pass local=locals() or local={**globals(), **locals()}. Most of the time you may just exit interactive console, add missing parameters and run program again. But what if the program was executing for couple of hours before interactive console was started? You might want to access e.g. global variables without running it again. Fortunately Python is a language for adults, so it’s totally doable.

The first option to access all variables etc is to use sys._getframe:

>>> import sys
>>> frame = sys._getframe(6)  # in my setup it happens to be 6th frame
>>> database_index = frame.f_globals['dataframe_index']

But in the help of _getframe we can read:

This function should be used for internal and specialized purposes only.

Fortunately there’s another module that also does what we want: inspect. It’s a little bit more verbose, but is not private.

>>> import inspect
>>> database_index = inspect.stack()[6].frame.f_globals['database_index']

But frankly speaking if you look at inspect’s code you discover that it uses sys under the bonnet. So my suggestion is to:

use sys._getframe in emergency situations (like in interactive console)
use inspect module if you are doing this in a script

def stack(context=1):
    """Return a list of records for the stack above the caller's frame."""
    return getouterframes(sys._getframe(1), context)

Obviously instead of code.interact an alternative can be used: pdb.set_trace. It doesn’t suffer such problems at all.

Bonus: python frames and surprising setter of `f_lineno`

Out of curiosity I looked into CPython sources. It looks like _getframe function does simple O(n) stack traversal. It retrieves frames from current thread state.

static PyObject * sys__getframe_impl(PyObject *module, int depth)
{
    PyFrameObject *f = _PyThreadState_GET()->frame;

    while (depth > 0 && f != NULL) {
        f = f->f_back;
        --depth;
    }
	...
    Py_INCREF(f);
    return (PyObject*)f;
}

_PyThreadState_GET as name suggests returns thread state object, where frame is one of the most important fields. Quick look at the definition of frame struct reveals what potentially can be done with it: f_back, f_code, f_globals, f_locals, f_lineno etc. My inner hacker woke up and I tried to change f_lineno of a frame to see what happens:

>>> sys._getframe().f_lineno = 1
ValueError: f_lineno can only be set by a trace function

This error is baked right into CPython! Apparently frame_setlineno function bails out when the caller is not a trace function. From the docs of the function we can also learn that f_lineno is used by tracking mechanism. It also describes some exceptions where you cannot jump:

Lines with an ‘except’ statement on them can’t be jumped to, because they expect an exception to be on the top of the stack.
Lines that live in a ‘finally’ block can’t be jumped from or to, since the END_FINALLY expects to clean up the stack after the ‘try’ block.
‘try’, ‘with’ and ‘async with’ blocks can’t be jumped into because the blockstack needs to be set up before their code runs.
‘for’ and ‘async for’ loops can’t be jumped into because the iterator needs to be on the stack.
Jumps cannot be made from within a trace function invoked with a ‘return’ or ‘exception’ event since the eval loop has been exited at that time.

I can only say that whatever detail you pick it becomes a rabbit hole. This is so beautiful.

tee equivalent as a Python class

2020-06-25T00:00:00+02:00

Do you know tee program? Its man page reads:

tee - read from standard input and write to standard output and files

It makes it easy to split output of one program into both stdout and files. It’s a nice UNIX tool. Recently I was doing code review and it turned out that equivalent of such thing may be pretty useful in Python programs too:

with open("file1.txt") as f1, tee(open("file2.txt")) as f2:
  shutil.copyfileobj(f1, f2)
  if f2.tail not in ('\r', '\n'):
    f2.fileobj.write('\n')

It allows to do extra work, so we can employ it to e.g. simultaneous hash calculation or other job.

I came up with this idea whilst reviewing some code. I saw following function (anonymized).

    def _some_private_method(cls, paths: Iterable[str]):
        special_paths = filter(is_special_path, paths)

        with open(FILEPATH, "wb") as out_file:
            for path in special_paths:
                LOGGER.info(f"Adding {path}")
                with open(path, "rb") as additional_file:
                    shutil.copyfileobj(additional_file, out_file)

                    additional_file.seek(-1, 2)
                    last_byte = additional_file.read()
                    if last_byte != b"\n" and last_byte != b"\r":
                        out_file.write(b"\n")

I don’t like such code. The seek hack is obscure. What can be done to make it better? What if we simply remembered what was the last byte copied by shutil.copyfileobj?

Unfortunately, copyfileobj accepts only two fileobjs and buffer size. Recently I was experimenting with indexed_gzip and I had to roll out my own copy of copyfileobj that apart from copying the data was also calculating md5 hash and number of bytes copied.

An alternative is to wrap one of the arguments with something that will do whatever we want. Let’s focus on the problem at hand: adding newline if necessary.

@dataclass
class MyTee:
    fileobj: io.BufferedReader
    tail: bytes = field(init=False)

    def write(self, data):
        self.tail = data[-1:]  # without a colon it would not become bytes()
        self.fileobj.write(data)

If we need to remember k last characters, we can simply use collections.deque as tail and it will work as a circular buffer.

In order to make it look like in the first listing we need to add trivial context manager:

@contextlib.contextmanager
def tee(fileobj):
    yield MyTee(fileobj)

And voille-a!

This mechanism can be further improved to be more flexible etc.

Roccat Suora driver for Linux > 4.11.0

2020-06-24T00:00:00+02:00

Recently I decided it’s time to install dedicated driver for my keyboard to programatically control its LED behavior. I have Roccat Suora keyboard. Fortunately all of the code is already available here. However the kernel module failed to compile because of signal_pending being undeclared. I had to add following code in hid-roccat.c and it worked like a charm.

#include <linux/version.h>
#if (LINUX_VERSION_CODE >= KERNEL_VERSION(4, 11, 0))
  #include <linux/sched/signal.h>
#endif

signal_pending was moved to linux/sched/signal.h

Explanation of C++ expression on Code::Dive T-Shirts

2019-11-29T00:00:00+01:00

This year Code::Dive conference was held in Wrocław for the sixth time. It is amazing how all of this has unfolded, especially given the fact that I was involved in it from its very beginning. In recent two years I had too few time and ideas to give talks, but I managed to make small contribution. The task was to prepare some eye-catching slogan or something similar to put on T-Shirts for conference attendants.

C++ roots are common for me and the conference, so I suggested we should put some fancy C++ expression on the T-Shirts:

(+[](){})();

For non-C++ guys it might look like a sequence of random characters, but it is perfectly valid C++ expression. For C++ programmers it isn’t even odd except the plus sign. Lot of attendees were questioning this syntax, but they were coming back saying that “indeed it compiles! What the hell is that?”.

This is what makes it somewhat special - single annoying character that makes you’re not able to explain what’s going on. Frankly speaking this isn’t something I came up with out of blue. I saw it many years ago, it caught my attention and I recalled it when I was thinking about the T-Shirt.

For the sake of this post let’s see how can we decompose the expression to make it much simpler and what the + sign actually does.

In the middle of the expression we have anonymous function, a.k.a lambda. C++ syntax feels like a compromise between readability and expressiveness.

[] () {}

In order to implement anonymous function programmer has to specify three things. This is C++ trivia.

[] - what variables from the outer scope the function will capture (capture list)
() - function parameters
{} - function body

In our case we’re not capturing any variables, we’re not getting any parameters and we’re doing nothing, hence it becomes three empty blocks - [](){}.

Now we know what the core of the expression does. Let us use lambda symbol to denote it and see how it simplifies overall expression:

(+λ)();

We’re left with the plus sign, parentheses and a semicolon - something we shouldn’t even consider. Let’s explain the plus sign, or unary operator+ to be precise. It isn’t simple without referring to the standard. Good explanation was given in this SO question.

Generally speaking evaluation of a lambda function depends on whether it captures or not. In our case it doesn’t capture. Let’s see what the standard has to say about that:

The closure type for a lambda-expression with no lambda-capture has a public non-virtual non-explicit const conversion function to pointer to function having the same parameter and return types as the closure type’s function call operator. The value returned by this conversion function shall be the address of a function that, when invoked, has the same effect as invoking the closure type’s function call operator.

So when a compiler encounters lambda with empty capture list it will create a class that has that specific conversion operator. In other words there exists a method that converts such a lambda to a pointer to function.

Now let’s see how unary operator+ works:

The operand of the unary + operator shall have arithmetic, unscoped enumeration, or pointer type and the result is the value of the argument.

Maybe the wording is unintuitive but it’s all about:

+1 == 1;
+var == var;
+ptr == ptr;

Hopefully now it’s clear :). For the record I was going ever further on making the expression more cryptic - e.g. ((void)(([](auto)->void{})(+[](){})));, but… as it is stated in famous Python’s Zen: Simple is better than complex.

This may come as a surprise to some of you, but C++ is not even close to other languages when it comes to syntax oddities. Please take a look at Malbolge:

 (=<`#9]~6ZY32Vx/4Rs+0No-&Jk)"Fh}|Bcy?`=*z]Kw%oG4UUS0/@-ejc(:'8dc

savis - visualize SQLAlchemy models without fuss

2019-11-09T00:00:00+01:00

Some time ago I realized that our project is a victim of NoSQL hype (hey! hype cycle). It was actually my fault when I introduced it. There was specific motivation behind that decision, but that’s something I would like to keep for separate post.

Couple of days ago I started to work on a plan to migrate to SQL. I extracted all of the keys and respective schemas from our NoSQL store and started doing ML. By ML I don’t mean Machine Learning, but Manual Labor :-).

Our project grew surprisingly big in terms of number of keys and relations between them. One option to proceed would be to rewrite everything from scratch. But I didn’t want to do so. Based on my experience such big rewrites almost always backfire. I was looking how to split the keyspace so we can proceed in more iterative way. I was experimenting a lot and what I missed was an easy way to write models and see ERD (entity relationship diagram).

Ideally what I was looking for would:

let me control the data (yes, online ERD tools, I’m looking at you)
let me write model/entity definitions only once
let me create ER diagram without running database
let me test these models in action without any modifications

All online and offline tools didn’t match my requirements. The only project I’ve found was eralchemy, which is great, but you either have to run database, or write models in a custom markdown format.

Maybe there is something I couln’t find in the internet, who knows. But at that time I decided to write small pet project that will satisfy me. It’s called savis - SqlAlchemy VISualizer. The tool is available at GitHub. Rest of the post contains details how it was built.

Eralchemy is definitely a great tool. It’s close to what I wanted but in order to obtain ER diagram you need to

write your models, run migrations, extract schema
create a copy of your models in markdown notation

We don’t want to run database, because we might want to change our models quite rapidly and see the impact immediately. This leaves us with option b only. But how do we maintain only one definition of our models? It’s simple: we write a program that reads Python source files, extracts all of the models and prints them out in target format.

Although it could be done by interpreting Python files just as they were test files, but this wouldn’t be bullet proof. But Hey! Python’s motto is “Batteries included”! It comes with a library we can use to do this the right way - ast. We’re gonna use parser to convert textual file into Abstract Syntax Tree. Then it’s all about using it to find all of the classes, filtering out those which aren’t models, extracting class members and producing final output. Let’s see how do we do all of this.

Firstly we need to look for possible files. We can use glob library to do this:

for file_ in input_dir.glob('**/*.py'):
    process_file(file_)

We’re using construct called glob. It looks like a regex, but it isn’t one. Whenever you use bash you can pass such a glob expression - e.g. in order to recursively find files with .py extension you can use following spell. I highly recommend to read linked documentation as even seasoned software engineers aren’t aware of certain features. See man glob for further details.

ls **/*.py

Neat thing, isn’t it? Python’s glob library does the same. Speaking about paths… I recommend pathlib library which is used to work with paths like a boss. Representing paths with strings can be cumbersome. Maybe in our case it would be like using a cannonball against a fly, but knowing/using more libraries won’t hurt.

Once we have a path to file that can contain model definitions we should look for them! A model is a Python class that has extra field: __tablename__. We will make use of this requirement.

But how do we convert a file into something we can work on? How do we use ast library? It’s pretty simple as it turns out:

with open(file_) as f:
    root = ast.parse(f.read(), file_)

ast library parses the file and returns tree root. From now on we can continue working on that tree. We have to recursively traverse it to find classes that are indeed SQLAlchemy models. At some point we’ll want to iterate over models to generate their representation. for model in get_models(tree) looks like pythonic way, so our implementation should be a generator.

All nodes in the tree that aren’t classes should be omitted. Since each node is of specific type we can filter nodes using type call. If the node isn’t of ast.ClassDef type we should recurse, because there still might be class definitions deeper. Please consider following example. This is, BTW, good example why doing a grep-like processing is bad idea.

def foo(var=None):
  if var:
    class Bar:
      pass
    return Bar()

Second ingredient is __tablename__ class member. If it’s present in the class definition, then we’re talking about SQLAlchemy model. Here’s the code:

def find_models(node):
    if isinstance(node, ast.ClassDef):
        if '__tablename__' in get_member_names(node):
            yield node

    if hasattr(node, 'body'):
        for child in node.body:
            yield from find_models(child)

Some of the nodes won’t have body attribute, so we have to filter out these too. I believe that the code is straightforward, so let’s continue with how we can implement get_member_names. This function will extract names of all class members.

def get_member_names(node):
    if not hasattr(node, 'body'):
        return
    for member in node.body:
        if not hasattr(member, 'targets') or not member.targets:
            continue
        for target in member.targets:
            yield target.id

Again, we’re checking if this is bodyful node. Then we’re iterating over all childs of that particular node. We’re looking for ast.Assign members (e.g. variable = value) but there’s no need to use type - we can directly check if there is target attribute. Nodes having it are representing assignments. Target is, simply speaking, a variable to which value will be written to. We iterate over targets (Python support constructs like a, b = a, b where there are more targets than one) and yield them. Voille-a!

Frankly speaking we’re almost done. We need to extract types of the class members, check out which parameters are passed to them etc. All fields that represent columns in SQL will be a sqlalchemy.Column instances. primary_key is a keyword used to denote a primary key etc. We just need to take all of this into account. Full code is available here.

What it gives us finally? A markdown file that eralchemy can understand and display. It’s not complicated. It’s not complete either. But it works :-).

GarageTalks: Taming Kubernetes jobs with Python

2019-09-24T00:00:00+02:00

Recently Nokia launched fancy new Garage Talks meetup. It gravitates around cloud technologies, development tools, architecture etc. I had a talk last time and it was about Kubernetes jobs and how you can create and controll them using official python client. Slides can be found here.

HexIT Escape Room for IT geeks - escape if you can (Wrocław, Poland)

2018-10-03T15:55:00+02:00

I'm glad to announce that we have launched an escape room that targets IT people (developers, testers etc). So far it works well for one month and about 30 teams (3-4 people) have already enjoyed it.

Please yourself and pay us a visit! Basing on the reactions of other teams I can guarantee remarkable experience. You don't need to be a "hackerman" to complete the room, but if you are you will do so faster ;-). Teams can be mixed too (but at least one person with basic programming skills is rather required).

Room location & partnership with Let Me Out.

ul. Bernardyńska 4 (close to Galeria Dominikańska), IInd floor (map below)

Link to Google Maps

Book here: letmeout.pl (select Wrocław)

I'm the author and creator of the room but I'm not running the business. The company that operates the room is Let Me Out and has excellent portfolio of other escape rooms in lot of Polish cities and Brussels.

Room theme
It goes like this:
Another country is trying to become an atomic superpower through the development of nuclear weapons, which consequently results in the destabilization of the region and the escalation of the international conflict on an unprecedented scale. The world is on the verge of the outbreak of World War III. The only salvation is to infect the secret plantation of uranium treatment with a computer virus. Will a group of programmers be able to prevent nuclear war in 90 minutes?

I received some suggestions that the room itself should be marketed as "an ordinary escape room with extra IT riddles". And this is actually what I wanted to build. Not to give a desk, PC and Jira for the players, but give them a nice mix of good background story with many different IT riddles.
Solve the riddles while saving the world! :-)

Easter Egg
In the room there are some easter eggs. One of them will let you listen to some famous song. The code is what comes out of `1900 + 80 + 9` and you need to properly enter it. You'll know where once you're there ;)

Room name
Funny fact about the name: it incorporates four things:

Hex, as a reference to hexadecimal numbers (you'll see some of them ;-)
Hex, as a uranium industry jargon name for Uranium Hexafluoride
Exit, related to Escape word
IT - information technology

Global app variables in connexion & aiohttp

2018-08-17T12:56:00+02:00

tl;dr: use pass_context_arg_name and api.subapp

Nowadays microservice architecture seem to be the default way distributed applications are build. Also, people started to treat APIs as a first-class citizen. Hence, it's no surprise that projects like Swagger/OpenAPI are gaining popularity on a daily basis.

One of Python OpenAPI implementations that I discovered recently is Connexion. Advantages of using OpenAPI are obvious: e.g. you can decouple endpoints schema from app logic and have only single place where whole API is described. Even the fact that there's Swagger UI for API users can be quite beneficial.

In the past I've been looking at different frameworks like django-rest, but nothing seemed as simple as Connexion. I decided to play it with right after discovering that the guys from Zalando added support for aiohttp (asynchronous HTTP server) - the framework we use extensively in our projects.

So what's the problem? What this post is about? Although Connexion is great, it is undocumented (or my DuckDuckGo-foo sucks and this is in fact just not well-documented) how to glue it with how global variables are handled in aiohttp - using app as an container for globals. Consider following snippet:

async def handler(request):
# this is how aiohttp creators recommend to access global variables
# e.g. database handle
request.app['redis_con'].incr('visits')
return web.Response(body=b'hello')

Nothing much more than ordinary aiohttp handler that uses redis_con global. Unfortunately using globals with Connexion is not that straightforward. Example how Connexion handlers look like (following comes from Connextion docs):

def example(name: str) -> str:
return 'Hello {name}'.format(name=name)

There's no request parameter! It took me some time to find out how to let Connexion pass request (aiohttp context) to handlers. I had to dig into source code to figure out following:

def start(redis_con):
    app = connexion.AioHttpApp(__name__, specification_dir='swagger/')
    api = app.add_api('api.yml', pass_context_arg_name='request')
    api.subapp['redis_con'] = redis_con
    app.run()

We're passing pass_context_arg_name parameter and it turns out that for aiohttp the context is the request. The unintuive thing is that subapp part. We need to use it in order to set global. This part I have found in aiohttp_jinja2.setup function. Now, we can use it in handlers like following.

async def handler(*args, **kwargs):
kwargs['request'].app['redis_con'].incr('visits')
return web.Response(body=b'hello')

That's all. Seems like easy thing, but nowhere online could I find it.

Handling multiple identical USB ethernet adapters (Raspberry PI, udev)

2018-07-21T22:47:00+02:00

You have to build simple ethernet-connected chain of devices and continuously check that it's healthy. In order to save money and time you decide to replace individual devices (say Raspberries) with multiple USB ethernet adapters. You buy Chinese ones. What could go wrong?

We're building an escape room. There's plenty of them in Wrocław but our is special, because it's dedicated for IT guys. Random people would have lot of trouble solving even first riddles. These riddles are supposed to be great fun for tech people.

I don't want to spoil what are the riddles. Let us stay with the technical problem that I had at hand. Multiple devices need to be accessible in some specific configuration to solve one of the riddles. It made no sense to have these devices if their only purpose was to respond to some ICMP packet (certainly there is even more low-level solution, but we need something easy and reliable now). We decided to limit number of these and to attach USB ethernet adapters to each. My colleague has bought some Chinese adapters like on picture below and problems emerged immediately.

BTW the funny fact about CE marks on some devices (I'm not sure about this one) may not actually be CE marks but "China Export". You can read more about it here.

Perfect hardware clones!

So what's the problem? Well... when I firstly plugged in first adapted I made some configuration changes in Raspbian and was happy that everything works flawlessly. However, couple of days after I connected second adapter to the same device and it was the time when the problem surfaced. All of these USB adapters had the same MAC address. To make it even worse, after inspecting what's in /sys, I was sure that all of the USB parameters are also identical. In other words these devices were perfect clones. ROM was the same for all of them! And btw one out of 8 was not working at all.

Why this is a problem? It's because if the names are the same, kernel will rename network interface name to something like rename{number} and there's no reliable way to tell which interface is connected to which cable. Sadly, they also share the same MAC, so if you connect all adapters to the same switch, funny things will start to happen!

U~~boot~~dev for the rescue

I'm not that into Linux, but I immediately knew where to look for - udev. I was afraid that there won't be a way to differentiate between adapters at udev level and I was right.

However, some silly (maybe not silly. If something is silly but it works it means it's not silly ;-) solution is possible: differentiate USB ports rather than the devices themselves.

I started to read documentation and have found that you can create rules based on ports, like following:

SUBSYSTEM=="net", KERNELS=="1-2:1.0", ATTR{address}=="00:e0:4c:53:44:58"

net is the subsystem we want. USB port must be provided in KERNELS parameter (S at the end is both intentional and crucial). By providing address attribute you may further target only these Chinese adapters you have on the desk.

Finding out usb ports proved to be a little tricky task. You can do it using udevadm utility.
I have prepared diagram for my RPi 3:

Please take note that this may be different in your case. The reason is that it all depends on:

hardware revision
firmware versions
kernel version
kernel modules version

Once we know these USB "addresses" we can write rules. Rules are below. I'd like to additionally emphasize two things:

you can target using ATTR{address}=="mac-here", but apparently there's no way to change it (ATTR{address}="new-mac" doesn't work)
changing MAC address is still possible (e.g. ifconfig <ifname> hw ether ...) and you can even use the name you set, but you must use absolute paths to executable!

SUBSYSTEM=="net", KERNELS=="1-1.2:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek1", RUN+="/sbin/ifconfig kabelek1 hw ether 00:e0:4c:00:00:01"
SUBSYSTEM=="net", KERNELS=="1-1.4:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek2", RUN+="/sbin/ifconfig kabelek2 hw ether 00:e0:4c:00:00:02"
SUBSYSTEM=="net", KERNELS=="1-1.3:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek3", RUN+="/sbin/ifconfig kabelek3 hw ether 00:e0:4c:00:00:03"
SUBSYSTEM=="net", KERNELS=="1-1.5:1.0", ATTR{address}=="00:e0:4c:53:44:58", NAME="kabelek4", RUN+="/sbin/ifconfig kabelek4 hw ether 00:e0:4c:00:00:04"

And voile-a! You are free to connect lot of adapters to single Raspberry. You still need to maintain USB-port and Ethernet cables coupling and also you will need to do something with the cables ;)

This is how my desk looked like when I was figuring things out.

To summarize, almost everything can be done and if something really can't, then you somehow can circumvent. However I believe this trick is just palliative. Chinese adapters can backfire any time, so if you require reliability, then you should look for other hardware.

Preconfigured Jenkins cluster in Docker Swarm (proxy, accounts, plugins)

2018-01-03T08:38:00+01:00

In recent years lot of popular technologies were adjusted so they can run in Docker containers. Our industry even coined new verb - dockerization. When something is dockerized we usually expect it to behave like self-contained app that is controlled with either command line switches or environment variables. We also assume that apart of this kind of customization the dockerized thing is zero-conf - it will start right away with no further magic spells.

It's just awesome when things work that way. Unfortunately there are exceptions and Jenkins is one of them. The problem with Jenkins is that even when you start it from within a container, you still need to:

open configuration wizard (it's a web page)
prove that you're the guy: pass it's challenge by reading some magic file and pasting its content into configuration wizard
configure proxy, if you're behind one
select plugins to be installed during initialization
setup admin account

Pretty bad. It resembles installation wizard like in Windows. Phew. Couple of weeks ago I was trying to check out how well Jenkins would solve one of our data transformation (ETL) problem and was unsure how many times it will be deployed. Hence I needed to do something about this installation process so it sucks less. All of the building blocks were already on the table: Terraform, Ansible and Docker Swarm. The missing part was pre-configured dockerized Jenkins running in the Swarm.

So this post, in DuckDuckGo-friendly list, explains how to:

pre-configure Jenkins with custom user (admin) account
pre-configure Jenkins with a proxy
pre-configure Jenkins with specified plugins
run Jenkins master and slaves entirely in Docker Swarm with Jenkins' own Swarm plugin for automatic master-slave connection establishment
allow Jenkins jobs to execute other Docker containers nearby (the daemon's sock trick)

http://www.rustypants.net/wp-content/uploads/2008/10/satanspbeach.jpg

Abandon all hope, ye who enter here.

I remember that in one of C projects (~~not sure what was it, but perhaps something from GNU, maybe RMS~~ update: it was xterm) there was this comment "abandon all hope, ye who enter here". ~~It also mentioned how many people have ignored this warning and tried to refactor something.~~ I have the same reflections w.r.t. configuring Jenkins without custom Groovy scripts. I was reluctant to learn new language, but eventually this seemed like the most reasonable way to continue.

Of course, all of following problems can be solved in a troglodyte way too. E.g. you can configure by hand, extract Jenkins home directory, targz it and re-use. But that brings couple of other problems. Also, surprisingly fresh Jenkins home weighted about 70MBs in my case. I always thought that it's just bunch of XML files, but perhaps it's not that straightforward. Since primitive solutions didn't work right away, I decided to stop for a while and try to solve the problem "the right way".

System overview & requirements.

System is simple: there's one master (and it's an brilliant example of a SPOF, but nobody cares, since you're unsure of future) and number of workers (slaves). We want workers to register to the master automatically. Unfortunately this is not possible using plain JNLP solution, because you need to register the worker in master prior to establishing a link. In theory you could do some curl magic, but fortunately there's a plugin that does it for you - Jenkins Swarm (not to be confused with Docker Swarm, as it has literally nothing to do with it). Jenkins Swarm plugin consists of two things: a plugin for master Jenkins and Java JAR for slaves.
So we're set up. Jenkins Swarm will take care of auto-connecting slaves. Now, we must run dockerized version of these slaves and put it to Docker Swarm. But before we talk about slaves, let's handle the master.

Jenkins master with plugins, proxy, and extra configuration.

Let me paste Dockerfile and explain it line by line.

FROM jenkins/jenkins:2.89.1-alpine

ARG proxy
ENV http_proxy=$proxy https_proxy=$proxy

USER root
RUN apk update && apk add python3
COPY requirements.txt /tmp/requirements.txt
RUN pip3 install -r /tmp/requirements.txt

USER jenkins

COPY plugins.txt /plugins.txt
RUN /usr/local/bin/install-plugins.sh swarm:3.6 workflow-aggregator:2.5

ENV JAVA_OPTS="-Djenkins.install.runSetupWizard=false"
COPY security.groovy /usr/share/jenkins/ref/init.groovy.d/security.groovy
COPY proxy.groovy /usr/share/jenkins/ref/init.groovy.d/proxy.groovy
COPY executors.groovy /usr/share/jenkins/ref/init.groovy.d/executors.groovy

We must start with some Jenkins image in order to customize it. In my case that's slim Alpine Linux version 2.89.1. Then there's build argument for the proxy. You can ignore this part if you're not behind one.

Before we modify the image, we need to switch to root user. After we're done we should switch it back to jenkins fo better security (if you wonder how to check it without base image Dockerfile, docker history command is your friend). In my case I'm also installing some python3 stuff defined in requirements.txt dependency file. If you're not willing to add any package to the system, you can skip this entire part too.

Then, we approach configuring plugins. In different places in Internet you can find an advice to use /usr/local/bin/plugins.sh but believe me you don't want to do this, as this installs plugins without their dependencies. Newer install-plugins.sh script takes care of dependencies for you. In our case we're installing two plugins. You might want to install just the essential one - the swarm plugin.

Now, four nonstandard lines. I believe that setting runSetupWizard to false is self-explanatory. The rest of lines are there for account setup, proxy configuration and executors configuration.

Let's start with setting up admin account. Groovy here we go!

#!groovy

import jenkins.model.*
import hudson.security.*
import jenkins.security.s2m.AdminWhitelistRule

def instance = Jenkins.getInstance()

def user = new File("/run/secrets/jenkinsUser").text.trim()
def pass = new File("/run/secrets/jenkinsPassword").text.trim()

def hudsonRealm = new HudsonPrivateSecurityRealm(false)
hudsonRealm.createAccount(user, pass)
instance.setSecurityRealm(hudsonRealm)

def strategy = new FullControlOnceLoggedInAuthorizationStrategy()
instance.setAuthorizationStrategy(strategy)
instance.save()

Jenkins.instance.getInjector().getInstance(AdminWhitelistRule.class).setMasterKillSwitch(false)

I'm not Groovy expert so don't judge me by the code above. I have started with just knowledge that it runs over JVM :). It's actually looks like nice managed language. The good part is that, as in Python, the code mostly speaks for itself. Hudson Legacy is visible here as well. I won't go into details - if you want to know from where all of this magic comes, pay a visit to official docs. Don't forget that you can also use infamous Jenkins console. I found Groovy's dump built-in very helpful too.
So the above script will actually setup an admin account, but doesn't hardwire anything. Both username and password come from Docker Secrets that allows you to manage sensitive data in your Swarm cluster nicely.

Now, the second script is for proxy:

#!groovy

import jenkins.model.*
import hudson.*

def instance = Jenkins.getInstance()
def pc = new hudson.ProxyConfiguration("1.2.3.4", 8080, null, null, "localhost,*.your.intranet.com");
instance.proxy = pc;
instance.save()

Here's some magic too. It sets up proxy 1.2.3.4:8080 but with specified exceptions. Then it modifies Jenkins instance (which seem to be a singleton).

And finally, executors part. I wanted this one so master is not used as a worker at all.

import jenkins.model.*
Jenkins.instance.setNumExecutors(0)

Slaves.

Now, since the master is ready, let's configure slaves. Their Dockerfile is as follows.

FROM docker:17.03-rc

ARG proxy
ENV https_proxy=$proxy http_proxy=$proxy no_proxy="localhost,*.your.intranet.com"

RUN apk --update add openjdk8-jre git python3
RUN wget -O swarm-client.jar http://repo.jenkins-ci.org/releases/org/jenkins-ci/plugins/swarm-client/3.3/swarm-client-3.3.jar

ENV http_proxy= https_proxy=
COPY entrypoint.sh /
RUN chmod +x /entrypoint.sh
CMD ["/entrypoint.sh"]

This time base image is docker, because we want to have docker installed within this docker container (so this container can spawn other containers). After setting proxies (the part that is not mandatory) we must download Java Runtime Environment version 8 and download swarm-client JAR. I'm using version 3.3 which is accessible through URL as for today.
Finally, there's an entrypoint that will execute swarm-client and do all the magic, but it heavily relies on Docker Secret named jenkinsSwarm, which should look like following.

-master http://master_address:8080 -password jenkinsUser -username jenkinsPassword

Here master_address must be known to slave machines (e.g. in /etc/hosts, Consul or something). You should also include username and password - the same ones that you share in other Docker Swarm secrets.

If you're using Ansible like I do, it's pretty straightforward to utilize variables instead not to hardcode credentials. For instance ansible-vault can be used for this.

entrypoint.sh itself is almost one-liner:

mkdir /tmp/jenkins
java -jar swarm-client.jar -labels=docker -executors=1 -fsroot=/tmp/jenkins -name=docker-$(hostname) $(cat /run/secrets/jenkinsSwarm)

It assumes that it's running in the Swarm and can access /run/secrets/jenkinsSwarm (the line that's pasted above).

Glueing it all together.

Building blocks are already in place. Now it's time to glue everything together. I don't want to go into details here, because this is not primary topic of this blog post. If you're interested in how personally I did everything please let me know in comments, so I will create GitHub repo. Let me however give you some important hints:

if you want slave to be able to spawn other containers (on the same host on which the slave is running), you must bind mount docker.sock file, e.g. like this: "/var/run/docker.sock:/var/run/docker.sock". There's more to this, though! Docker daemon will not allow jenkins user to spawn containers, so you must somehow circumvent this problem. I'm circumventing this by adding jenkins user to docker group, but this works only because there's 1:1 mapping between the host and container.
you should have three secrets in Docker Swarm cluster: jenkinsUser, jenkinsPassword and jenkinsSwarm with username, password, and swarm-client.jar arguments respectively
machines must be able to communicate. For internal JNLP communication, port 50000/tcp must be opened.
if you set deployment mode to global in docker-compose.yml file (if you're using one), then you will have as much slaves as machines in the cluster, which can be nice
if you're gonna stick to this solution for a longer period of time I recommend to think about horizontally scaling out and in: it should be as simple as adding/removing machines from the cluster: just one terraform command followed by ansible-playbook spell.

Hopefully this post helps you with setting up Jenkins cluster that simply works. If you'd like to see the code, let me know in comments!

Airflow Docker with Xcom push and pull

2017-12-08T22:36:00+01:00

Recently, in one projects I'm working on, we started to research technologies that can be used to design and execute data processing flows. Amount of data to be processed is counted in terabytes, hence we were aiming at solutions that can be deployed in the cloud. Solutions from Apache umbrella like Hadoop, Spark, or Flink were at the table from the very beginning, but we also looked at others like Luigi or Airflow, because our use case was neither MapReducable nor stream-based.

Airflow caught our attention and we decided to give it a shot just to see if we can create PoC using it*. In order to execute PoC faster rather than slower, we planned to provision Swarm cluster for this.

In the Airflow you can find couple of so-called operators that allow you to execute actions. There are operators for Bash or Python, but you can also find something for e.g. Hive. Fortunately there is also Docker operator for us.

Local PoC
PoC started on my laptop and not in the cluster. Thankfully, DockerOperator allows you to pass URL to docker daemon, so moving from laptop to cluster is close to just changing one parameter. Nice!

If you want to run Airflow server locally from inside container, and have it running as non-root (you should!) and you bind docker.sock from host to the container, you must create docker group in the container that mirrors docker group on your host and then add e.g. airflow user to this group. That does the trick...

So just running DockerOperator is not black magic. However, if your containers need to exchange data it starts to be a little bit more tricky.

Xcom push/pull
The push part is simple and documented. Just set xcom_push parameter to True and last line of container stdout will be published by Airflow as it was pushed programatically. It looks that this is natural Airflow way.

Pull is not that obvious. Perhaps because it's not documented. You can't read stdin or something. The way to do this involves joining two dots:

command parameter can be Jinja2-templated
one of the macros allows you to do xcom_pull

So you need to prepare your containers in a special way so they can pull/push. Let's start with a container that pushes something:

FROM debian
ENTRYPOINT echo '{"i_am_pushing": "json"}'

Simple enough. Now pulling container:

FROM debian
COPY ./entrypoint /
ENTRYPOINT ["/entrypoint"]

Entrypoint script can be whatever and will get the JSON as $1. Crucial (and also easy to miss) thing that is required for it to work is that ENTRYPOINT must use exec form. Yes, there are two forms of ENTRYPOINT. If you use the one without array, then parameters will not be passed to the container!

Finally, you can glue things together and you're done. The ti macro allows us to get data pushed by other task. ti stands for task_instance.

dag = DAG('docker', default_args=default_args, schedule_interval=timedelta(1))

t1 = DockerOperator(task_id='docker_1', dag=dag, image='docker_1', xcom_push=True)

t2 = DockerOperator(task_id='docker_2', dag=dag, image='docker_2', command='{{ ti.xcom_pull(task_ids="docker_1") }}')

t2.set_upstream(t1)

Conclusion
Docker can be used in Airflow along with Xcom push/pull functionality. It isn't very convenient and is not well documented I would say, but at least it works.

If time permits I'm going to create PR for documenting pull op. I don't know how it works out, because in Airflow GH project there are 237 PRs now and some of them are there since May 2016!

* the funny thing is that we considered Jenkins too! ;-)

Tests stability S09E11 (Docker, Selenium)

2017-11-30T02:47:00+01:00

If you're experienced in setting up automated testing with Selenium and Docker you'll perhaps agree with me that it's not the most stable thing in the world. Actually it's far far away from any stable island - right in the middle of "the sea of instability".

When you think about failures in automated testing and how they develop when the system is growing it can resemble drugs. Seriously. When you start, occasional failures are ignored. You close your eyes and click "Retry". Innocent. But after some time it snowballs into a problem. And you find yourself with a blind fold put on but you can't remember buying it.

This post is small story how in one of small projects we started with occasional failures and ended up with... well... you'll see. Read on ;).

For past couple of months I was thinking that "others have worse setups and live", but today it all culminated, I have achieved fourth degree of density and decided to stop being quiet.

Disclaimer
In the middle of this post you might start to think that our environment is simply broken. That's damn right. The cloud in which we're running is not very stable. Sometimes it behaves like it had a sulk. There are problems with proxies too. And finally we add Docker and Selenium to the mixture. I think testimonial from one of our engineers sums it all:

if retry didn’t fix it for the 10^th time, then there’s definitely something wrong

And now something must be noted as well. The project I'm referring to is just a sideway one. It's an attempt to innovate some process, unsupported by the business whatsoever.

The triggers
I was pressing "Retry" button for another time on two of the e2e jobs and saw following.

// job 1
couldn't stat /proc/self/fd/18446744073709551615: stat /proc/self/fd/23: no such file or directory

// job 2
Service 'frontend' failed to build: readlink /proc/4304/exe: no such file or directory

What the hell is this? We have never seen this before and now apparently it became a commonplace in our CI pipeline (it was nth retry).

So this big number after /fd/ is 64-bit value of -1. Perhaps something in Selenium uses some function that returns an error and then tries to call stat syscall, passing -1 as an argument. Function return value was not checked!
The second error message is most probably related to docker. Something tries to find where is executable for some PID. Why?

"Retry" solution did not work this time. Re-deploying e2e workers also didn't help. I thought that now is the time when we should get some insights into what is actually happening and how many failures were caused by unstable environment.

Luckily we're running on GitLab, which provides reasonable API. Read on to see what I've found. I personally find it hilarious.

Insight into failures
It's extremely easy to make use of GitLab CI API (thanks GitLab guys!). I have extracted JSON objects for every job in every pipeline that was recorded in our project and started playing with the data.

The first thing that I checked was how many failures there are per particular type of test. Names are anonymized a little because I'm unsure if this is sensitive data or not. Better safe than sorry!

Fig 1: Successful/failed jobs, per job name

I knew that some tests were failing often, but these results tell that in some cases almost 50% of the jobs fail! Insane! BTW we recently split some of long-running e2e test suites into smaller jobs, which is observable from the figure.
But now we can argue that maybe this is because of the bugs in the code. Let's see. In order to tell this we must analyze data basing on commit hashes: how many commits in particular jobs were executed multiple times and finished with different status. In other words: we look for the situations in which even without changes in the code the job status was varying.

The numbers for our repository are:

number of (commit, job) pairs with at least one success: 23550
total number of failures for these pairs: 1484

In other words, unstable environment was responsible for at least ~6.30% of observable failures. It might look like small number, but if you take into account that single job can last for 45 minutes, it becomes a lot of wasted time. Especially that failure notifications aren't always handled immediately. I also have a hunch that at some time people started to click "Retry" just to be sure the problem is not with the environment.

My top 5 picks among all of these failures are below.

hash:job           | #tot | success/fail | users clicking "Retry"
----------------------------------------------------------------
d7f43f9c:e2e-7     | 19   | ( 1/17)     | user-6,user-7,user-5
2fcecb7c:e2e-7     | 16   | ( 8/ 8)     | user-6,user-7
2c34596f:other-1   | 14   | ( 1/13)     | user-8
525203c6:other-13 | 12   | ( 1/ 8)     | user-13,user-11
3457fbc5:e2e-6     | 11   | ( 2/ 9)     | user-14

So, for instance - commit d7f43f9c was failing on job e2e-7 17 times and three distinct users tried to make it pass by clicking "Retry" button over and over. And finally they made it! Ridiculous, isn't it?

And speaking of time I've also checked jobs that lasted for enormous number of time. Winners are:

job:status        | time (hours)
---------------------------------
other-2:failed    | 167.30
other-8:canceled | 118.89
other-4:canceled | 27.19
e2e-7:success     | 26.12
other-1:failed    | 26.01

Perhaps these are just outliers. Histograms would give better insight. But even if outliers, they're crazy outliers.

I have also attempted to detect reason of the failure but this is more complex problem to solve. It requires to parse logs and guess which line was the first one indicating error condition. Then the second guess - about whether the problem originated from environment or the code.
Maybe such a task could be somehow handled by (in)famous machine learning. Actually there are more items that could be achieved with ML support. Two most simple examples are:

giving estimation whether the job will fail

also, providing reason of failure
if the failure originated from faulty environment, what exactly was it?

estimated time for the pipeline to finish
auto-retry in case of env-related failure

Conclusions
Apparently I've been having much more unstable e2e test environment than I ever thought. Lesson learned is that if you get used to solve problem by retrying you loose sense in how big trouble you are.

Similarly to any other engineering problem you first need to gather data and decide what to do next. Basing on numbers I have now I'm planning to implement some ideas to make life easier.

While analyzing the data I had moments when I couldn't stop laughing to myself. But the reality is sad. It started with occasional failures and ended with continuous problem. And we weren't doing much about it. The problem was not that we were effed in the ass. The problem was that we started to arrange our place there. Insights will help us get out.

Share your ideas in comments. If we bootstrap discussion I'll do my best to share the code I have in GitHub.

C++: on the dollar sign

2017-04-26T00:15:00+02:00

In most programming languages there are sane rules that specify what can be an identifier and what cannot. Most of the time it's even intuitive - it's just something that matches [_a-zA-Z][a-zA-Z0-9]*. There are languages that allow more (e.g. $ in PHP/JS, or % in LabTalk). How about C++? Answer to this question may be a little surprise.

Almost a year ago we had this little argument with friend of mine whether dollar sign is allowed to be used within C++ identifiers. In other words it was about whether e.g. int $this = 1; is legal C++ or not.
Basically I was stating that's not possible. On the other hand, my friend was recalling some friend of his, which mentioned that dollars are fine.

The first line of defense is of course nearest compiler. I decided to fire up one and simply check what happens if I compile following fragment of code.

1 auto $foo() {
2     int $bar = 1;
3     return $bar;
4 }

At the time I had gcc-4.9.3 installed on my system (prehistoric version, I know ;-). For the record, the command was like this: g++ dollar.cpp -std=c++1y -c -Wall -Wextra -Werror.

And to my surprise... it compiled without single complaint. Moreover, clang and MSVC gulped this down without complaining as well. Well, Sławek - I said to myself - even if you're mastering something for years, there's still much to surprise you. BTW such a conclusion puts titles like following in much funnier light.

It was normal office day and we had other work to get done, so I reluctantly accepted this just as another dark corner. After couple of hours I forgot about the situation and let it resurface... couple of weeks later.

So, fast forward couple of weeks. I was preparing something related to C++ and I accidentally found a reference to the dollar sign in GCC documentation. It was nice feeling, because I knew I will fill this hole in my knowledge in a matter of minutes. So what was the reason compilers were happily accepting dollar signs?
Let me put here excerpt from GCC documentation, which speaks for itself :)

GCC allows the ‘$’ character in identifiers as an extension for most targets. This is true regardless of the std= switch, since this extension cannot conflict with standards-conforming programs. When preprocessing assembler, however, dollars are not identifier characters by default.
Currently the targets that by default do not permit ‘$’ are AVR, IP2K, MMIX, MIPS Irix 3, ARM aout, and PowerPC targets for the AIX operating system.
You can override the default with -fdollars-in-identifiers or fno-dollars-in-identifiers. See fdollars-in-identifiers.

I think three most important things are:

This ain't work in macros.
It doesn't seem to be correlated with -std switch.
Some architectures do not permit it at all.

What got me thinking it this list of architectures. And it took me couple of minutes to find out that e.g. assembler for ARM doesn't allow dollar sign. So any assembly code generated by GCC for ARM would not assemble if dollar sign was used. That's plausible explanation why GCC doesn't allow such a character for all architectures. It doesn't explain why compilers allow it for others, though.

GCC could theoretically mitigate problem with particular architectures by replacing $ signs with some other character, but then bunch of other problems would appear: possible name conflicts, name mangling/demangling would yield incorrect values, and finally it wouldn't be possible to export such "changed" symbols from a library. In other words: disaster.

What about the standard?

After thinking about it for a minute I had strong need to see what exactly identifier does mean. So I opened N3797 and quickly found section I was looking for, namely (surprise-surprise) 2.11 Identifiers. So what does this section say?

Right after formal definition there is an explanation which refers to sections E.1 and E.2. But that's not important here. There is one more thing that appears in the formal definition and it's extremely easy to miss this one. It's "other implementation-defined characters". What does it mean? Yup - the compiler is allowed to allow any other character to be used within identifiers at will.

P.s. surprisingly cppcheck 1.71 doesn't report $ sign in identifiers as a problem at all.

Getting all parent directories of a path

2017-01-06T19:28:00+01:00

edit: reddit updates

Few minutes ago I needed to solve trivial problem of getting all parent directories of a path. It's very easy to do it imperatively, but it would simply not satisfy me. Hence, I challenged myself to do it declaratively in Python.

The problem is simple, but let me put an example on the table, so it's even easier to imagine what are we talking about.

Given some path, e.g.

/home/szborows/code/the-best-project-in-the-world

You want to have following list of parents:

/home/szborows/code
/home/szborows
/home
/

It's trivial to do this using split and then some for loop. How to make it more declarative?
Thinking more mathematically (mathematicians will perhaps cry out to heaven for vengeance after reading on, but let me at least try...) we simply want to get all of the subsets from some ordered set S that form prefix w.r.t. S. So we can simply generate pairs of numbers (1, y), representing all prefixes where y belongs to [1, len S). We can actually ignore this constant 1 and just operate on numbers.
In Python, to generate numbers starting from len(path) and going down we can simply utilize range() and [::-1] (this reverses collections, it's an idiom). Then join() can be used on splited path, but with slicing from 1 to y. That's it. And now demonstration:

>>> path = '/a/b/c/d'
>>> ['/' + '/'.join(path.split('/')[1:l]) for l in range(len(path.split('/')))[::-1] if l]
['/a/b/c', '/a/b', '/a', '/']

But what about performance? Which one will be faster - imperative or declarative approach? Intuition suggests that imperative version will win, but let's check.

On picture below you can see timeit (n=1000000) results for my machine (i5-6200U, Python 3.5.2+) for three paths:

short_path = '/lel'
regular_path = '/jakie/jest/srednie/zagniezdzenie?'
long_path = '/z/lekka/dlugasna/sciezka/co/by/pierdzielnik/mial/troche/roboty'

Implementations used:

def imper1(path):
    result = []
    for i in range(1, len(path.split('/'))):
        y = '/'.join(path.split('/')[:i]) or '/'
        result.append(y)
    return result

def imper2(path):
    i = len(path) - 1
    l = []
    while i > 0:
        while i != 0 and path[i] != '/':
            i -= 1
        l.append(path[:i] or '/')
        i -= 1
    return l

def decl1(path):
    return ['/' + '/'.join(path.split('/')[1:l])

            for l in range(len(path.split('/')))[::-1] if l]

def decl2(path):
    return ['/' + '/'.join(path.split('/')[1:-l])

            for l in range(-len(path.split('/'))+1, 1) if l]

# decl3 hidden. read on ;-)

It started with imper1 and decl1. I noticed that imperative version is faster. I tried to speed up declarative function by replacing [::-1] with some numbers tricks. It helped, but not to the extend I anticipated. Then, I though about speeding up imper1 by using lower-level constructs. Unsurprisingly while loops and checks were faster. Let me temporarily ignore decl3 for now and play a little with CPython bytecode.

By looking at my results not everything is so obvious. decl{1,2} turned out to have decent performance with 4-part path, which looks like reasonable average.

I disassembled decl1 and decl2 to see the difference in byte code. The diff is shown below.

30 CALL_FUNCTION    1 (1 positional, 0 keyword pair) | 30 CALL_FUNCTION    1 (1 positional, 0 keyword pair)
33 CALL_FUNCTION    1 (1 positional, 0 keyword pair) | 33 CALL_FUNCTION    1 (1 positional, 0 keyword pair)
36 CALL_FUNCTION    1 (1 positional, 0 keyword pair) | 36 UNARY_NEGATIVE
39 LOAD_CONST       0 (None)                         | 37 LOAD_CONST       4 (1)
42 LOAD_CONST       0 (None)                         | 40 BINARY_ADD
45 LOAD_CONST       5 (-1)                           | 41 LOAD_CONST       4 (1)
48 BUILD_SLICE      3                                | 44 CALL_FUNCTION    2 (2 positional, 0 keyword pair)
51 BINARY_SUBSCR

As we can see [::-1] is implemented as three loads and build slice operations. I think this could be optimized if we had special opcode like e.g. BUILD_REV_SLICE. My little-optimized decl2 is faster because one UNARY_NEGATIVE and one BINARY_ADD is less than LOAD_CONST, BUILD_SLICE and BINARY_SUBSCR. Performance gain here is pretty obvious. No matter what decl2 must be faster.

What about decl2 vs imper1?
It's more complicated and it was a little surprise that such a longer bytecode can be slower than shorter counterpart.

3           0 BUILD_LIST               0
              3 STORE_FAST               1 (result)

4           6 SETUP_LOOP              91 (to 100)
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_CONST               1 (1)
             15 LOAD_GLOBAL              1 (len)
             18 LOAD_FAST                0 (path)
             21 LOAD_ATTR                2 (split)
             24 LOAD_CONST               2 ('/')
             27 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             30 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             33 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             36 GET_ITER
        >>   37 FOR_ITER                59 (to 99)
             40 STORE_FAST               2 (i)

5          43 LOAD_CONST               2 ('/')
             46 LOAD_ATTR                3 (join)
             49 LOAD_FAST                0 (path)
             52 LOAD_ATTR                2 (split)
             55 LOAD_CONST               2 ('/')
             58 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             61 LOAD_CONST               0 (None)
             64 LOAD_FAST                2 (i)
             67 BUILD_SLICE              2
             70 BINARY_SUBSCR
             71 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             74 JUMP_IF_TRUE_OR_POP     80
             77 LOAD_CONST               2 ('/')
        >>   80 STORE_FAST               3 (y)

6          83 LOAD_FAST                1 (result)
             86 LOAD_ATTR                4 (append)
             89 LOAD_FAST                3 (y)
             92 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             95 POP_TOP
             96 JUMP_ABSOLUTE           37
        >>   99 POP_BLOCK

7     >> 100 LOAD_FAST                1 (result)
            103 RETURN_VALUE

The culprit was LOAD_CONST in decl{1,2} that was loading list-comprehension as a code object. Let's see how it looks, just for the record.

>>> dis.dis(decl2.__code__.co_consts[1])
21           0 BUILD_LIST               0
              3 LOAD_FAST                0 (.0)
        >>    6 FOR_ITER                51 (to 60)
              9 STORE_FAST               1 (l)
             12 LOAD_FAST                1 (l)
             15 POP_JUMP_IF_FALSE        6
             18 LOAD_CONST               0 ('/')
             21 LOAD_CONST               0 ('/')
             24 LOAD_ATTR                0 (join)
             27 LOAD_DEREF               0 (path)
             30 LOAD_ATTR                1 (split)
             33 LOAD_CONST               0 ('/')
             36 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             39 LOAD_CONST               1 (1)
             42 LOAD_FAST                1 (l)
             45 UNARY_NEGATIVE
             46 BUILD_SLICE              2
             49 BINARY_SUBSCR
             50 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             53 BINARY_ADD
             54 LIST_APPEND              2
             57 JUMP_ABSOLUTE            6
        >>   60 RETURN_VALUE

So this is how list comprehensions look like when converted to byte code. Nice! Now performance results make more sense. In the project I was working on my function for getting all parent paths was called in one place and perhaps contributed to less than 5% of execution time of whole application. It would not make sense to optimize this piece of code. But it was delightful journey into internals of CPython, wasn't it?

Now, let's get back to decl3. What have I done to make my declarative implementation 2x faster on average case and for right-part outliers? Well... I just reluctantly resigned from putting everything in one line and saved path.split('/') into separate variable. That's it.

So what are learnings?

declarative method turned out to be faster than hand-crafter imperative one employing low-level constructs.
Why? Good question! Maybe because bytecode generator knows how to produce optimized code when it encounters list comprehension? But I have written no CPython code, so it's only my speculation.
trying to put everything in one line can hurt - in described case split() function was major performance dragger

reddit-related updates:
Dunj3 outpaced me ;) - his implementation, which is better both w.r.t. "declarativeness" and performance:

list(itertools.accumulate(path.split('/'), curry(os.sep.join)))

syntax highlighting done with https://tohtml.com/python/

Logstash + filebeat: Invalid Frame Type, received: 1

2017-01-03T09:02:00+01:00

Post for googlers that stumble on the same issue - it seems that "overconfiguration" is not a great idea for Filebeat and Logstash.

I've decided to explicitly set ssl.verification_mode to none in my Filebeat config and then I got following Filebeat and Logstash errors:

filebeat_1 | 2017/01/03 07:43:49.136717 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1 | 2017/01/03 07:43:50.152824 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1 | 2017/01/03 07:43:52.157279 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1 | 2017/01/03 07:43:56.173144 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1 | 2017/01/03 07:44:04.189167 single.go:140: ERR Connecting error publishing events (retrying): EOF

logstash_1       | 07:42:35.714 [Api Webserver] INFO logstash.agent - Successfully started Logstash API endpoint {:port=>9600}
logstash_1       | 07:43:49.135 [nioEventLoopGroup-4-1] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3
logstash_1       | 07:43:49.139 [nioEventLoopGroup-4-1] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1
logstash_1       | 07:43:50.150 [nioEventLoopGroup-4-2] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3
logstash_1       | 07:43:50.154 [nioEventLoopGroup-4-2] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1
logstash_1       | 07:43:52.156 [nioEventLoopGroup-4-3] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3
logstash_1       | 07:43:52.157 [nioEventLoopGroup-4-3] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1
logstash_1       | 07:43:56.170 [nioEventLoopGroup-4-4] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3
logstash_1       | 07:43:56.175 [nioEventLoopGroup-4-4] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1

It seems it's better to stay quiet with Filebeat :) Hopefully this helped to resolve your issue.

std::queue’s big default footprint in assembly code

2016-12-13T23:07:00+01:00

Recently I've been quite busy and now I'm kind of scrounging back into C++ world. Friend of mine told me about IncludeOS project and I thought that it may be pretty good exercise to put my hands on my keyboard and help in this wonderful project.

To be honest, the learning curve is quite steep (or I'm getting too old to learn so fast) and I'm still distracted by a lot of other things, so no big deliverables so far... but by just watching discussion on Gitter and integrating it with what I know I spotted probably obvious, but a little bit surprising thing about std::queue.

std::queue is not a container. Wait, what?, you ask. It's a container adapter. It doesn't have implementation. Instead, it takes other implementation, uses it as underlying container and just provides some convenient interface for end-user. By the way it isn't the only one. There are others like std::stack and std::priority_queue to name a few.

One of the dimension in which C++ shines are options for customizing stuff. We can customize things like memory allocators. In container adapters we can customize this underlying container if we decide that the one chosen by library writers isn't good match for us.

By default, perhaps because std::queue requires fast access at the beginning and end, it's underlying container is std::deque. std::deque provides O(1) complexity for pushing/popping at both ends. Perfect match, isn't it?

Well, yes if you care about performance at the cost of increased binary size. As it turns out by simply changing std::deque to std::vector:

std::queue<int> qd;
std::queue<int, std::vector<int>> qv;

Generated assembly code for x86-64 clang 3.8 (-O3 -std=c++14) is 502 and 144 respectively.

I know that in most context binary size is secondary consideration, but still I believe it's an interesting fact that the difference is so big. In other words there must be a lot of things going on under the bonnet of std::deque. I don't recommend changing deque to vector in production - it can seriously damage your performance.

You can play around with the code here: https://godbolt.org/g/XaLhS7 (code based on Voultapher example).

slawomir.net

Heirloom on journal cover

Automating full-page web screenshots without ads and other crap

Caveats and further steps

babla: command line translation tool (Polish-English)

Automating vimdiff’s HTML diff (TOhtml)

Subprocess

headlessvim

Accessing globals after wrong code.interact() call

Bonus: python frames and surprising setter of f_lineno

tee equivalent as a Python class

Roccat Suora driver for Linux > 4.11.0

Explanation of C++ expression on Code::Dive T-Shirts

savis - visualize SQLAlchemy models without fuss

GarageTalks: Taming Kubernetes jobs with Python

HexIT Escape Room for IT geeks - escape if you can (Wrocław, Poland)

Global app variables in connexion & aiohttp

Handling multiple identical USB ethernet adapters (Raspberry PI, udev)

Preconfigured Jenkins cluster in Docker Swarm (proxy, accounts, plugins)

Abandon all hope, ye who enter here.

System overview & requirements.

Jenkins master with plugins, proxy, and extra configuration.

Slaves.

Glueing it all together.

Airflow Docker with Xcom push and pull

Tests stability S09E11 (Docker, Selenium)

C++: on the dollar sign

Getting all parent directories of a path

Logstash + filebeat: Invalid Frame Type, received: 1

std::queue’s big default footprint in assembly code

Bonus: python frames and surprising setter of `f_lineno`