Wikimedia – DcK Area

Get URL from Git commit hash

Dereckson — Sun, 06 Mar 2022 13:47:34 +0000

SHA-1 Git hashes can be mapped to code review or code repository URL to offer a web visualization with additional context.

The resolve-hash command allows to get such URL from a Git hash, or another VCS reference. It can search Phabricator, Gerrit, GitHub and GitLab currently.

Ouf of the box, it will detect your ~/.arcrc configuration and use GitHub public API. You can create a small YAML configuration file it to add Gerrit and GitLab in the mix.

Install it. Use it.

The resolve-hash package is available on PyPI.

$ pip install resolve-hash

$ resolve-hash 6411f75775a7aa8db
https::⃫github.com/10up/simple-local-avatars/commit/6411f75775a7aa8db2ef097d70b12926018402c1

Specific use cases

Projects moved from GitHub to GitLab

GitLab requires any query to the search API to be authenticated. You can generate a personal access token in your user settings; the API scope is enough, so check only read_api.

Then you can add create a $HOME/.config/resolve-hash.conf file with the following content:

# GitLab
gitlab_public_token: glpat-sometoken

For Wikimedia contributors

Gerrit exposes a REST API. To use it, create a $HOME/.config/resolve-hash.conf file with the following content:

# Gerrit REST API
gerrit:
  - https://gerrit.wikimedia.org/r/

Gerrit will be then queried before GitHub:

$ resolve-hash 311d17f289470
https::⃫gerrit.wikimedia.org/r/c/mediawiki/core/+/768149

Note if you’ve configured Arcanist to interact with phabricator.wikimedia.org, your configuration in ~/.arcrc is used BEFORE the Gerrit one. Tell me if you’re in that case, we’ll allow to order resolution strategies.

What inspired this project?

Terminator allows plugins to improve the behavior of the terminal. Some plugins allows to expressions like Bug:1234 to offer a link to the relevant bug tracker.

What if we can detect hashes, especially VCS hashes, to offer a link to the code review system, like Phabricator or Gerrit, or at least to a public code hosting facility like GitHub?

What’s next?

We can add support for private instances of GitHub Enterprise and GitLab. Code I wrote in VCS package is already ready to accept any GitHub or GitLab URL, and is so prepared to accept a specific instance, so it’s a matter of declare new configuration options and add the wrapper code in VcsHashSearch class.

A cache would be useful to speed up the process. Hashes are stable enough for that.

Write a Terminator plugin, so we solve the root problem described above.

The code is extensible enough to search other kind of hashes than commits, but I’m not sure we’ve reliable sources of hashes for know files or packages.

References

Nasqueron DevCentral
Python package repository
- Package on PyPI
- README

One year of contributions to Wikimedia — 2016

Dereckson — Tue, 31 Jan 2017 01:23:00 +0000

Some statistics I’ve computed about my production contributions to Wikimedia:

342 actions logged on the server admin log

573 commits to Wikimedia repos, of which:

248 commits to operations/mediawiki-config

5 new wikis created (hello tcy.wikipedia.org)

Thanks to all the people whom I’ve met or I’ve been engaged with during this year for these contributions.

MediaWiki now accepts out of the box RDFa and Microdata semantic markup

Dereckson — Fri, 18 Mar 2016 19:17:46 +0000

Since MediaWiki 1.16, the software has supported — as an option — RDFa and Microdata HTML semantic attributes.

This commit, integrated to the next release on MediaWiki, 1.27, will embrace more the semantic Web making these attributes always available.

If you wish to use it today, this is already available in our Git repository.

This also simplify slightly the cyclomatic complexity of our parser sanitizer code.

Microdata support will so be available on Wikipedia Thursday, 24 March 2016 and on other projects Thursday, 23 March 2016.

If you already use RDFa today on MediaWiki

First, we would be happy to get feedback, as we’re currently considering an update to RDFa 1.1 and we would like to know who is still in favour to keep RDFa 1.0.

Secondly, there is a small effort of configuration to do: open the source code of your wiki and look the tag.

Copy the content of the version attribute: you should see something like like .

Now, edit InitialiseSettings.php (or your wiki farm configuration) and set the $wgHtml5Version setting. For example here, this would be:
$wgHtml5Version="=HTML+RDFa 1.0";

For the microdata, there is nothing special to do.

December 2014 links

Dereckson — Thu, 01 Jan 2015 07:40:41 +0000

Some links of stuff I appreciated this month. Links to French content are in a separate post. You can also take the time machine to November 2014.

AI

What if instead to understand how the brain works, we copy the neural connections as is? This is what the OpenWorm project tries to do with C. elegans. And, big surprise, that works and allows a bot to move.

Wikipedia

An infographics of the locality of Wikipedia participants shows without any surprise they are mainly from Europe and North America.

If you’re into dumps, the Wikipedia / MediaWiki XML dump grepper will help you to find a particular piece of data, like the text of one article.

Tools

Dev / search. The silver searcher, ag, offers a faster approach than ack to search your code.

Fun / autogenerator. Some years ago, cgMusic offered an implementation on how a computer program could create music. Add some image generation techniques and a word generators, and you can have a fake music generator offering full albums. Ælfgar has stumbled upon Liquified Death by Income Yield.

GIS. Turf is a new open source JavaScript GIS library. This post explains the capabilities and features, including its great offline support.

Electronics

What if an Arduino embeds a web server and allows programmation from the web browser? This is exactly what the Photon by Spark does.

Quartz

An infographics showing satellites orbiting Earth and a point of view of the Uber economy.

Literature

The GoT series offer some comprehensive scenes of torture. Did you ask yourself their interest or need for the plot? Marie Brennan offers a great opinion in « Welcome to the Desert of the Real ».

November 2014 links

Dereckson — Mon, 01 Dec 2014 07:45:49 +0000

Some links of stuff I appreciated this month. Links to French content are in a separate post. You can also take the time machine to October 2014.

November is the Philae landing on the Comet Churyumov-Gerasimenko month and the ESA photo release under CC-BY-SA (one of them here) month. Mainly DevOps links in this post, a Wikidata tool and an algorithm visualisation.

Churyumov-Gerasimenko 67P, 20 November 2014
ESA/Rosetta/NAVCAM, CC-BY-SA 3.0 IGO

Dev

Craft. Jeroen de Dauw has prepared interesting slides about clean functions. Your function should do one task, not be a class disguised in procedural code.

Raft. In a distributed environment, how do you achieve a similar state? Raft is an answer to this question, as a distributed consensus algorithm. To understand how it works, The Secret Lives of Data offers a visual guide.

Wikidata

Wikidata no labels. Harmonia Amanda and Hsarrazin wanted to find items without labels in French, respectively about the Tolkien’s Legendarium or Russians persons to translate. This tool allows you to get some Wikidata items through a WDQ query or to encode them directly, and print a table with the part of these items without label in the specified language.

DevOps

Once upon a time there were a Linux theme park. As a Cobbler / SpaceWalk alternative, we start to see new software to appear: katello/foreman. It’s a part of Katello, the upstream of Satellite 6, and a replacement for SpaceWalk. You want to dive into the Linux theme park? Build images, deploy, manage resources? You’ll be served. Thank you to jnix for these software recommendation.

And now, near the sea. ShipYard allows you to manage Docker instances and containers.

But what is more interesting is the alpha release of OpenShift Origin, the third generation of OpenShift, with a new system design. It relies on Docker and the following technologies:

Kubernetes, an active controller to orchestrate and ensure the desired state of the containers;
An etcd server (which uses the Raft algorithm described above);

With that concepts, you’re ready for the introduction hands-on tutorial available.

The puppetmaster becomes old. Ryan Lane, formerly in Wikimedia ops team, blogged this summer about a Puppet alternative at his new job: Moving away from Puppet: SaltStack or Ansible? For Ryan, 10K+ lines of Puppet codes is now only 1K of SaltStack or Ansible code. The winner of their test to port the Puppet infrastructure into both is SaltStack. It’s a pity, I would have loved to merge yet another fictional universe into the Nasqueron project and add the Ursula K. Guin ansible in the mix.

Sysadmin

FreeBSD 10.1. The first new version of FreeBSD after the SSL bugs is out, and will immediately be deployed on Ysul and Sirius machines as test. Bhyve can use a pure ZFS filesystem and UDP-Lite protocol is finally here.

October 2014 links

Dereckson — Fri, 31 Oct 2014 07:31:02 +0000

Some links of stuff I appreciated this month. Links to French content are in a separate post.

In the servers world

SSL. October is the month we disabled SSLv3 protocol support from nginx following the POODLE attack. So this means we can look to this paper, nginx configuration and a tool to check SSL configuration. The provider Linode has published a comprehensive guide to mitigate the attack.

FreeBSD. FreeBSD 10.1-RELEASE will soon be available. The virtual terminal console driver vt is improved. Oh, and you can now boot bhybe on ZFS. Shell servers will have to deal with the fact login.conf settings will take precedence on .profile and other shell environment for variables like path, blocksize or umask.

Docker. To improve Docker workflow, nitrous.io has released tug, a set of scripts in Go to help common tasks.

Thus shall ye compile in JavaScript

Humble Bundle launches the Humble Mozilla Bundle, games compiled in ASM.js and so playable in the browser.

Meanwhile, in the functionnal language world, a paper shows you can compile OCaml in JS, an it’s sometimes quicker in the JS JIT than it its own JIT (but well… you can also compile OCaml in native, and OCaml JIT isn’t really well optimized).

So if you want to respect this commandment, just compile your C code with clang: emscripten will then happily compile your LLVM bytecode in ASM.js.

Gamergate / NotYourShield

A CNN journalist reads the gamergate as the end of the narration controlled by journalists.

When an Examiner journalist suggests #NotYourShield is 4chan white heterosexual users posing as women and PoC, his tweet is replied with a lot of photos from women and PoC. We so now have a picture of the diversity in video games (permanent link).

On a related theme, I Can Tolerate Anything Except The Outgroup is interesting to read and heavily commented.

Finally, a call for help:

I have a new project, but I need your help. Looking for diverse female voices in STEM that could donate their time & expertise.

— Randi Actually (@freebsdgirl) October 20, 2014

Curiosities

Some scientists push to a new definition of planet, to take in account exoplanets. In such a definition, Pluto would be again a planet. Harvard organized a debate, this position wins.

At Databricks, they carved this pumpkin for halloween:

Carved something scary into pumpkin cc/ @jamesiry @databricks (office jackolantern) pic.twitter.com/sqiNaYGDQy

— Heather Miller (@heathercmiller) October 27, 2014

126 473

Dereckson — Wed, 20 Nov 2013 13:24:52 +0000

126 473 …

… this is the amount of English wikipedia contributors allowed to participate to the 2013 2013 Arbitration Committee Elections.

The eligibility condition checked by this number is the amount of accounts having made at least 150 mainspace edits by 1 November 2013.

The English Wikipedia has now 12 years. This figure so means there is something like ten thousands contributors having made a significant contribution to the main namespace each year, a little less than 1 000 per month, at least one per hour.

If the English Wikipedia would be a country, and those people its population, it would the 192th according this list of countries by population. This is something between Jersey or the United States Virgin Islands, and Guam.

Thank you to Ælfgar for this country comparison idea.

MediaWiki nginx configuration file

Dereckson — Thu, 24 Oct 2013 19:07:08 +0000

Scenario

You have a nginx webserver
You have several MediaWiki installation on this server
You would like to have a simple and clear configuration

Solution

You want a configuration file you can include in every server {} block MediaWiki is available

Implementation

Create a includes subdirectory in your nginx configuration directory (by default, /usr/local/etc/nginx or /etc/nginx).
This directory can welcome every configuration block you don’t want to repeat in each server block.
You put in this directory mediawiki-root.conf, mediawiki-wiki.conf or your own configuration block.
In each server block, you can now add the following line:
```
Include includes/mediawiki-root.conf;
```

Configuration I – MediaWiki in the root web directory, /article path

This is mediawiki-root.conf on my server:

        # Common settings for a wiki powered by MediaWiki with the following configuration:
        #   (1) MediaWiki is installed in $root folder
        #   (2) Article path is /<title>
        #   (3) LocalSettings.php contains $wgArticlePath = "/$1"; $wgUsePathInfo = true;

        location / {
            try_files $uri $uri/ /index.php?$query_string;
        }

        location ~ ^/images/thumb/(archive/)?[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ {
            #Note: this doesn't work with InstantCommons.
            if (!-f $request_filename) {
                rewrite ^/images/thumb/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ /thumb.php?f=$1&width=$2;
                rewrite ^/images/thumb/archive/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ /thumb.php?f=$1&width=$2&archived=1;
            }
        }

        location /images/deleted    { deny all; }
        location /cache             { deny all; }
        location /languages         { deny all; }
        location /maintenance       { deny all; }
        location /serialized        { deny all; }
        location ~ /.(svn|git)(/|$) { deny all; }
        location ~ /.ht             { deny all; }
        location /mw-config         { deny all; }

Configuration II – MediaWiki in the /w directory, /wiki/article path

This is mediawiki-wiki.conf on my server:

        # Common settings for a wiki powered by MediaWiki with the following configuration:
        #   (1) MediaWiki is installed in $root/w folder
        #   (2) Article path is /wiki/<title>
        #   (3) LocalSettings.php contains $wgArticlePath = "/wiki/$1"; $wgUsePathInfo = true;

        location /wiki {
            try_files $uri $uri/ /w/index.php?$query_string;
        }

        location ~ ^/w/images/thumb/(archive/)?[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ {
            #Note: this doesn't work with InstantCommons.
            if (!-f $request_filename) {
                rewrite ^/w/images/thumb/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ /w/thumb.php?f=$1&width=$2;
                rewrite ^/w/images/thumb/archive/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ /w/thumb.php?f=$1&width=$2&archived=1;
            }
        }

        location /w/images/deleted  { deny all; }
        location /w/cache           { deny all; }
        location /w/languages       { deny all; }
        location /w/maintenance     { deny all; }
        location /w/serialized      { deny all; }
        location ~ /.(svn|git)(/|$) { deny all; }
        location ~ /.ht             { deny all; }
        location /w/mw-config       { deny all; }

Example of use

www.wolfplex.org serves other application is subdirectories and MediaWiki for /wiki URLs.

This server block:

is a regular one
includes our includes/mediawiki-wiki.conf configuration file (scenario II)
contains a regular php-fpm block
contains other instructions

    server {
        listen          80;
        server_name     www.wolfplex.org

        access_log      /var/log/www/wolfplex.org/www-access.log main;
        error_log       /var/log/www/wolfplex.org/www-error.log;
        root            /var/wwwroot/wolfplex.org/www;
        index           index.html index.php index.htm;

        [...]

        include         includes/mediawiki-wiki.conf;

        location / {
            #Link to the most relevant page to present the project
            rewrite /presentation/?$ /w/index.php?title=Presentation last;

            #Link to the most relevant page for bulletin/news information:
            rewrite /b/?$ /w/index.php?title=Bulletin:Main last;

            [...]
        }

        [...]

        location ~ \.php$ {
            try_files $uri =404;
            fastcgi_pass   127.0.0.1:9010;
            fastcgi_param  SCRIPT_FILENAME  $document_root$fastcgi_script_name;
            include        fastcgi_params;
        }
    }

Some notes

Configuration is based on Daniel Friesen’s MediaWiki Short URL Builder, who collected various working nginx ones. There are some differences in the rewrite, our goal here is to have a generic configuration totally agnostic of the way .php files are handled.
Our configuration (not the one generated by the builder) uses a if for the thumbnails handler. The nginx culture is a culture where you should try something else than an if. See this nginx wiki page and this post about the location if way of work for more information.

8 bit music in pure JavaScript

Dereckson — Sat, 24 Aug 2013 14:52:08 +0000

HTML 5 offers a Web Audio API, to add synthesizing audio support in web applications.

Cody Lundquist, an Australian from Sidney, created a 8 bit music audio library built on the top of the Web Audio API, called 8Bit.js Audio Library.

You define a time (e.g. 4/4), a tempo, and you then the notes.

Submitted 2 days on Reddit, the library got a favorable reception, with some people adapting themes. There is even an original composition, rather nice, called Cities.

A LilyPond support is planned, so in the future there could be a possibility to implement this library into the MediaWiki score extension.

Not yet for every browser

Safari 6 supports it, so only on iOS and Mac, not yet on Windows.
Chrome 10+ supports it, and so Opera 15,
Firefox, Internet Explorer and Opera 12 don’t support it.
[2020 edit: Firefox has added support for it in versions 25 (desktop) and 26 (mobile)]

Links

Listen to Tetris with 8Bit.js
8-bit music on Wikipedia
Web Audio API specifications
AngularJS, the framework used by 8Bit.js
Apple documentation about the Web Audio API
Coming soon support on Firefox

Acknowledgment

Thanks to Linedwell for the help during browsers test.

Gerrit activity feeds :: a design and infrastructure sneak peak

Dereckson — Sun, 20 Jan 2013 11:11:30 +0000

Gerrit provides nice views by changes, but doesn’t offer synthetic and consolidated views.

Activity feeds will be timelines to offer these views;

What are the users’ last activities (commits, patchsets, merges) on Gerrit?
What’s going on on the mediawiki/extensions/SemanticMediaWiki repository?

Here the homepage dashboard:

And here the wireframe of the project activity feed:

About the design

This code is built on the top of Foundation, a responsive CSS framework. This allows to provide a smooth experience for your phone or tablet: columns will collapse into a more linear view if resolution width is narrow.

Avatars uses Gravatar. When an user doesn’t have a Gravatar account, identicons are used.

About the infrastructure and code

A Node service acts as proxy, and mirrors the Gerrit events stream, so it’s available to any simple TCP connexion instead to require a SSH connection.

I’ll provide access to this Node server to the community, so any tool with socket and JSON support with be able to interact with Gerrit events. If you’ve a need for a push model, ie to post notifications, please let me know the format and I will take care of that.

Then, a script reads the stream and write the XML feeds. It also monitors the Node -> SSH connection, to relaunch the service if needed (e.g. if the Jenkins server is rebooted). These XML feeds are publicly accessible, so you can also create a service based on them.

Finally, XSLT will be used to render these feeds in HTML and RSS documents. That’s for the humans and the most generic tools..

It will be at this moment time to take care of special needs, like combined feeds for Google Summer of Code or the Outreach Program for Women.

What do you think of this and do you need?

Please tell me what you think of this tool, and what you would like to find on this tool.

We also need a cool name for the application.

Building a wifi based IoT device for home automation

loic.quertenmont@gmail.com (Loic Quertenmont) — Wed, 11 Apr 2018 11:17:28 +0000

In this blog post, we will build an Internet Of Thing (IoT) device based on the super cheap ESP8266 chip. The device is used to automate home shutters at a predefined time of the day or according to the house temperature in order to limit the temperature increase caused by sunlight. The ESP8266 chip is a microcontroller with WiFi capabilities and a full TCP/IP stack. We can, therefore, have a complete web server and REST API running in it so as an MQTT client that can be used to send commands to the chip from anywhere and also to receive data from the chip. The project was split into phases:

1) Desired capabilities:

The requirements for this device are:

Cheap: We have ~10 shutters to control so it has to be cheap and for sure cheaper than equivalent commercial devices.
Grid-Powered: It has to be connected to the power plug as we don't want to change the battery of every single device yearly.
Small: We want it to be located inside the wall of the house in the cavity behind the manual shutter switch (that we are replacing by this device). The footprint should be around 50x40mm with a depth of less than 25mm.
Connected: It should onboard a webserver, a WiFi antena, and be capable of acting simultaniously as a wifi station or as an access point.
Over The Air (OTA) Programming: We should be able to update the software of the device remotely. The device will be hardly accessible once placed in the wall.
Control: The device should be able to roll-down and to roll-up the shutter. It should therefore be able to open/close 220V highly inductive circuits.
Easy: It should be easy to build and to program, and it has to be well docummented and widely supported by the community.

2) Designing:

The minimal pieces that we need to put together in order to meet these requirements are:

Connected, Cheap, Easy, Small, and OTA programming: The ESP8266 chip is the ideal candidate for this, it just cost a few dollars, and there are tons of tutorial and documentation all over the web. There are many module version integrating this chip, we opted for the ESP-12E version whih has good WiFi range performance and quite enough GPIO to control our shutters and other stuff like temperature sensors. The ESP8266 module requires a 3.3V power supply delivering 500mA (necessary during WiFi communication).
Conrol and Small: In order to control the shutters we basically have two options, either we use an electrical relay which is cheap but big, noisy, slow and with a limited lifetime or we use (opto-isolated) triacs that are expensive but have the advantage to be small, silent and robust. We chose to use the L4008D6 triacs which can handle 8 Amps in 400VAC and are snubberless which is quite necessary given the high inductance of the shutter motors. The two triacs are controlled by two MOC3081M optocouplers which also add an insulation between the high voltage (220V AC) circuit and the low voltage (3.3V DC) circuit.
Grid-Powered, Cheap and Small: We need a small footprint power supply that can delive the 3.3VDC required by the ESP8266 from the 220VAC wall plug. Usual passive transformers are excluded has those beast are rather big and expensive. So the only real alternative here are the "made in china" switching power supply that are quite often used in mobile phone chargers. Note that these devices should be used with a lot of care as they handle pretty high voltages (>KV) and load some large capacitors... this is quite a deadly mix so it should really be manipulated with caution.

In order to minimize the size of the entire device, we have no other choice than smartly designing the Printed Circuit Board (PCB) of the device to combine all these elements on a minimal space. Note: our final design uses a double sided PCB which have the switching power supply located behind the ESP8266 wifi antena. This is far from idea has the switching power supply could cause interferences that negatively affect the wifi antena. We will therefore need to make an initial prototype to make sure that this location is acceptable. We have also added a small temperature sensor on the board to monitor the Triacs temperature in the testing phase of the board. Note that the proximity of the temperature sensor to the power supply makes it totally unusable to measure the house temperature (there is an obvious bias of 4-5°C).

See below the schema for this device:

See below the board design for this device:

We can now compute what would be the cost of the device based on the component list and price for small quantities:

Component	Unit Price ($)	Quantity	Total Price ($)
ESP8266_ESP12E	2.40	1	2.40
Power-Supply 110V/220V to 3.3V 700mA AC-DC	1.47	1	1.47
OptoCoupler MOC3081	0.88	2	1.76
Triac L4008D6 or T405-600B	0.90	2	1.80
Resistance RC1206JR-07330RL	0.01	4	0.04
Resistance OM16G5E-R58	0.06	2	0.12
Temperature Sensor:	0.51	1	0.51
Connector ATB612-508-2P	0.08	1	0.08
Connector ATB612-508-3P	0.08	1	0.08
pin header 1x8	0.02	1	0.02
double sided PCB (from firstPCB)	1.84	1	1.84
TOTAL			10.12

So basically, we are at the level of 10$ per device. The cheapest commercial device I found on the web was around 50$, and they were not as complete as these. Note that there is still some soldering to be done to put all these together... so we are not completely comparing apple to apple. Note however that manual shutter switches already cost more than 10$.

3) From prototyping to production:

Before ordering all the components and making a PCB order in china (which usualy take a month), let's make a prototyping board on a home-made PCB. The manual shutter switch (first figure) was removed and replaced by the prototype that was inserted into the wall (seecond figure). Note that I am not able to make double sided PCB at home which explains why they are some wires here and there on the prototype. The prototype stayied in place for an entire week which also allowed me to make progress on the software (see next chapter). As I haven't observed any issue regarding WiFi connectivity during the testing period I move forward with the production of 10 devices (last figure).

4) Software Development:

One of the major advantage of the ESP8266 chip is that it is compatible with the Arduino software stack that is widely adopted by the hacker community world wide. There are therefore already many librariries available to do practically anything. See the ESP8266 Arduino github for a complete documentation. So here the software part is mostly reduced to just combinning all the libraries together. I won't go in to the details of my code, but instead, I will list the functionnality that I have implemented in the software.

The device is visible as a WiFi (password protected) access point to which we can connect to define login/password of the home wifi network (so the chip can connect itself as a station to the home wifi). We can also use this mode to rename the device and set the MQTT Broker address, port and credentials.
The device connects as a station to the home wifi and stay connected. The device also runs a SSDP and mDNS servers which allows the network to discover the device and get it's hostname and other details => that typically allows windows to properly list the device in the network center and provide easy access to the device web interface from windows.
The device run a webserver with a list of endpoints to
- Retrieve onboard temperature
- Roll-up / Roll-Down / Stop / Tilt the shutter controlled by the device
- Retrieve device uptime
- Retrieve hardware and software details (software version, MAC address, IP address, device name, connected clients to the device access point, etc.)
- Interact with an external I2C device (that can be connected to the device thanks to connector JP1)
- Set MQTT Broker address, port and credentials
- Update a new firmware via Over The Air programming
- Reboot the device
- ...
The device run a MQTT client which basically give the same functionnality than the webserver. The client listen on the topic <DEVICE_NAME>/in and publishes to the topic <DEVICE_NAME>/out/log, where basically every action are pushed so we can easilly monitor a fleet of devices from our favorite message broker (i.e. mosquitto). We can also imagine adding other publishing topics for instance to push temperatue measurements every 5 minutes.
In case of WiFi disconnection, the device wait a bit and then try to reconnect.

5) Central Server:

Finally, we need one more piece to orchestrate all the devices at once from a nice user interface. This last piece is a small server that can host the MQTT broker and server as a bridge between the user and the devices. For this, we have used an old raspberry pi on which we have installed the mosquitto MQTT broker and on which we have also deployed a small django website from which we can roll the shutters up and down individuall or all at once. In addition, the django server also run a cron scheduler which automatically roll the shutters up in the morning and down in the evening. More complex logic can also be implemented in there to take actions according to special conditions liek temperature, vacation, etc. This sever is also used as an insulation layer between the wide internet and the secured local network. See the figure bellow to see how this small user interface looks like:

Conclusion:

We have conducted an entire IoT project from A to Z satisfying a specific list of constrains. We've designed the electronic shema for the device, built a prototype, made a small production of final devices, develop their software, setup a server to communicate with these devices using either REST API or MQTT and finally, we have built a simple but handy user interface to orchestrate all these devices.

Do you have needs for something similar ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Reverse Image Search in Aerial and Satellite pictures

loic.quertenmont@gmail.com (Loic Quertenmont) — Fri, 19 May 2017 05:30:49 +0000

In this blog post, we will see how we can use reverse image search based on (unsupervised) convolutional neural networks to make the analysis of satellite/aerial pictures both more efficient and simpler. After reading this post, you will be able to find similar objects in a large aerial/satellite images and from there develop your own GIS statistical applications (i.e. to count all white cars in your neighborhood, identify specific road markings or kind of trees, etc. ).

Introduction

If you haven't done it yet, we suggest you start this blog by reading our previous blog post introducing the concepts behind reverse image search algorithm. We are reusing here all the concepts and technologies that we have previously introduced there. In particular, we are going to reuse the same topless VGG16 algorithm. A large aerial picture of a west coast neighborhood is used as a benchmark for today's blog. The benchmark picture was taken from the Draper satellite image chronology kaggle competition.

The main differences with respect to the introduction blog post are listed bellow.

Large pictures

Satellite or aerial pictures are generally quite large compared to the size of the objects of interest in the pictures. For instance, the dimension of our benchmark picture is 3100x2329 pixels while the typical size of a car on the picture is about 40x20 pixels. In comparison, in the introductory course, we had 104 pictures of at most 166x116 pixels.

The best way to handle this difference is to divide the picture into same-size tiles. We compute the picture DNA of each tile using the topless VGG16 algorithm introduced earlier and later on, we will build up a similarity matrix by computing the cosine similarity between all pairs of tiles DNA. The size of the tiles should be slightly larger than the size of the objects we are interested in. In this blog, we are interested in objects with a scale of the order of a meter and we've therefore decided to use tiles of 56x56 pixels. The figure bellow shows the benchmark picture with a grid of 56x56 cells.

ConvNets are performing better at identifying object features when those are well centered on the image. Therefore, we need to guarantee that objects of interest are never shared between two tiles. We don't want the tile separation grid to cut the objects in two. The best way to guarantee that is to duplicate all tiles with an offset of half the size of a tile. Of course, we need to apply an offset either horizontally, vertically or both. The size of the offset is typically half of the tile size, but it can also be smaller. In today's benchmark, we are using an offset of 14 pixels (a quarter of the tile size). With such an offsetting, the number of tiles is multiplied by 16. This guarantees that the object of interest is always centered on at least one tile and thus to have the ConvNet extracting properly the object features.

A note about computing time: The total number of tiles is inversely proportional to the square of the tile size. The computing time to compute the tile DNA scales linearly with the number of tiles, but the similarity matrix computation scales quadratically with the number of tiles. So you should perhaps think twice before using tiny tiles.

Pictures from the sky

By definition, satellite or aerial pictures are taken from the sky.... this might appear as a negligible detail for what we are trying to achieve, but it is not. As previously explained, the VGG16 algorithm was trained using the ImageNet dataset that is made of "Picasa"-like pictures. These pictures have a vertical orientation. This means that cars always have their wheels at the bottom of the pictures. Similarly, characters and animals have their head above their legs in most of the pictures. The algorithm has learned that the pictures aren't invariant against rotation or top/down flipping. In other words, the meaning of a car picture with the wheels at the top is very different than the same picture with the wheels pointing down the road.

On the contrary, pictures from the sky have an invariant meaning against either flip symmetry or rotation. In aerial pictures, the orientation of the object is totally random and has no particular meaning. A picture of a car driving toward the west or the east is still the picture of a driving car.

Since the VGG16 algorithm has learned to account for the orientation of the image, we will need some extra steps to make our similar picture finder insensitive to the orientation. We have two options:

Option1: we retrain the VGG16 algorithm to learn that picture A and picture A flipped or picture A rotated must have the same DNA vector. This is a fully unsupervised learning, as we don't need a set of labeled pictures to perform this training. we can use randomly chosen tiles from the main picture to perform this training. However, we would still need a relatively large amount of time to perform this retraining. In addition, there is some risk that the performance of the VGG16 algorithm at identifying picture features get reduced by this loss of picture orientation. This option might be worth doing if you consider a larger project with a gigantic amount of tiles to process.

Option2: instead of computing just one DNA vector per picture, we can compute one DNA vector per picture and per symmetry transformation. Then, we'll try to identify similar images, we can compare the reference picture DNA vector to the vectors of all other pictures and their symmetries. This approach is a bit heavier in terms of CPU when computing the similarity matrix, but the implementation is straight forward. Note that the VGG16 is expected to be already insensitive to left/right symmetry, so the only symmetry that we should consider are actually rotations. The top/down symmetry is not needed either as it could be decomposed as an 180-degree rotation followed by a left/right symmetry.

For today's benchmark, we opted for the option2 and the symmetry that we considered are simply the four 90-degree rotations. This virtually increases the number of tiles by yet a factor 4. In total, we have considered 143664 tiles+symmetries in this benchmark demo. This leads to the computation of about 10 billions cosine similarities.

Full Demo

For this benchmark, we have created a full application demo with:

a D3js frontend: which allows having a nice interface on top of the algorithm. The D3js interface that we built allows zooming on a specific area of the picture (using mouse wheel), to move within the picture by dragging it and to select an area of interest by clicking on the picture. Once an area of interest is selected, the 10 most similar area are highlighted. Moving the mouse over those area shows the value of the cosine similarity.
a Django backend: where the hard code processing is being executed. The Django backend allows us to execute our python code on the fly and ease the access to the database storing precomputed cosine similarity for all the tile+symmetry pairs.
a database with an index: which allows returning the most similar area associated to a specific tile in no time.
a result filtering module: Because we have several tiles covering the same area (or a part of the area) due to the offsetting strategy we choose, we have good chances that the algorithm find that the most similar areas are the reference area shifted by a small offset (a quarter of the tile size in our benchmark example). Although this is indeed a valid result, it is for sure not the interesting results we are naively expecting. So we have added a cross-cleaning module that rejects tiles from the result list that are either partially covering the reference area or a tile with a better similarity score that is already in the result list

The full demo is visible at the bottom of this blog post but is also available on this page.

Performance on specific examples

In this section, we are discussing the performance of the algorithm at finding similar objects for some specific examples. You can repeat these tests by yourself in the demo app. In the series of picture bellow, the top-left image shows the reference area (the one for which we are trying to find similar matches) in blue. The rest of the image is just there to show the reference tile in its context. The 9 other pictures show the closest matches in their context. For tiles that are close to the main picture borders, the missing context (outside the picture) is shown as a black area, For the matches, the similarity score is also given for reference. In both the reference and matches pictures, the tile coordinates are also given, so you can try to locate the tile in the main picture (or in the demo app).

Road marking:

The reference object is a usual road marking symbol. In the 9 similar tiles proposed by the algorithm, we see that 6 of them have the same road marking. The 3 others are clearly false positive, but if we look closer at the reference picture, we can notice that it captures a car side on the left and right of the images as well as large area of road. The 3 false positive pictures don't have the road marking, but still, have these 3 other features, so it could also be considered to be a similar image. We could avoid this false match, by adapting the reference area to the exact size of the road marking if we need to develop such a marking finder application.

Dark cars:

The reference object is a rather dark vehicle parked along the sidewalk. In the 9 similar tiles proposed by the algorithm, we see that 4 of them have the same type of vehicle in the same context. The others are clearly false positive, generally showing roof parts. Here the algorithm clearly focuses on the color pattern of the reference image where the dark gray area of the road is close to a lighter area from the sidewalk. Interestingly, a similar type of patterns caused by light/shadow effects is also present in some house roof pictures.

White cars:

This time, the reference object is a white vehicle parked along the sidewalk. In the 9 similar tiles proposed by the algorithm, we see that 6 of them are really perfect matches. Another one picked up a colored car instead of a white one. And the 2 others are just showing a reference picture with 50% of roads and 50% of sidewalks. In these cases, the algorithm focused on features of the background rather than on the vehicle. We can also notice that in these images, there is a rectangular shape on the sidewalk which may be interpreted as a vehicle by the algorithm.

White cars in front of a house:

We can play the same game with a white car that is parked in front of a house this time. This time we have a perfect score. All pictures are indeed showing similar pictures.

Road, sidewalk, and grass:

Let's try a picture made of three "background" parts: road, sidewalk and some grass on the corner. In the absence of a "main" object, the algorithm may focus on unexpected picture features, so it's an interesting example. We would say that the algorithm is providing meaningful pictures in 8 out of 9 of the cases. On these pictures, there is indeed always some part of roads, sidewalks, and grass. The proportion of each component may differ significantly, but they are always present. Sometimes we have an extra object in addition of these components (eg. a car or a bush), but that's OK. We also count a false positive again caused by light/shadow effect on a roof.

Road corner:

Another complex example is this reference picture showing a road corner (with a curved sidewalk and some grass). The four firsts pictures are quite positive matches? The first one, in particular, is almost identical to the reference picture. We can also note that the algorithm also catches road turning in a quite different way. The other 5 pictures are totally wrong which is consistent with their relatively low similarity score. A quick look at the global picture easily gives us the explanation: there is actually only very few part of the images showing some road corners, so the algorithm as a hard time finding a similar area. So it gives what he can find that is showing similar features, We can, for instance, notice the circular swimming pool that has almost the same bend radius than the road corner.

Solar panels:

Finding solar panels on house roof from aerial pictures has a lot of application in marketing, statistics, and energy forecast. So it is interesting to see how well the algorithm is performing at this simple task. We got 3/9 matches which are indeed showing solar panels. Three pictures are showing roof with some objects on it. And the last three are complete false positive. But looking again at the global picture, we can notice that the number of roofs with solar panels are actually quite limited in this neighborhood, so again, the algorithm has a hard time finding matches and does what he can. Moreover, the size of tile is not necessarily appropriate at finding objects of that size.

Palm trees:

Finding specific species of trees also have many sorts of daily life applications. Is this algorithm capable of making the difference between a palm tree and an oak? Let's try.
We picked up a reference picture clearly showing a palm tree next to a house. The matches show 6 pictures showing palm trees while the three other matches are showing other sorts of trees. What is interesting is that in some cases, the algorithm has a better match on the shadow of the tree rather than on the three itself. This is quite unexpected, but looking closer at those matching picture, we would say that this is also true for the human eye.

Full Demo:

You can find bellow the full demo that we have built up using the topless VGG16 algorithm wrapped in a Django backend and a D3js frontend.

Click on an area of interest in the satellite image below. The Deeper Solution algorithm for reverse image search will find the 10 tiles in the region that look the most similar to the place you selected. Move the mouse over a picture match (blue square) to see its similarity score compared to the reference area (red square). You can zoom on the picture using the mouse wheel and pan/move the picture by holding the mouse button down while moving the mouse. Open the demo in a separated window.

Have fun!

Conclusion

In conclusion, we have seen that convolutional neural networks are quite helpful at finding similar areas in either aerial or satellite pictures. We have demonstrated that encoded small pictures area into small feature-sensitive DNA vector make the picture finder both efficient and fast. We were able to find very specific objects like palm trees, solar panels or specific types of cars, without even training the algorithm at recognizing those objects. A small benchmark application was developed to demonstrate the easiness of deploying such a technique for business solutions. Moreover, the approach and the algorithm can easily be scaled to extremely large pictures (or picture collections) covering large cities or even countries using distributed computing (through Apache Spark for instance).

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Reverse image search based on (unsupervised) convolutional neural network

loic.quertenmont@gmail.com (Loic Quertenmont) — Fri, 12 May 2017 17:00:11 +0000

The requirement for a (very) large training set is generally the main criticism that is formulated against deeplearning algorithms. In this blog, we show, how deep convolutional neural networks (CNN) can be used in an unsupervised manner to perform efficient reverse image search.

The requirement for a (very) large training set is generally the main critic that is formulated against deeplearning algorithms. In this blog, we show how deep convolutional neural networks (CNN) can be used in an unsupervised manner to identify similar images in a set. At first, we remind the general concepts behind convolutional neural networks. In a second step, we discuss the VGG16 algorithm and how it was trained. We then explain the concepts of transfer learning and retraining. After that, we have all the ingredients to introduce the reverse image search algorithm. Finally, we will show a demo of the reverse image search performance with a deck of cards.

Convolutional neural network

Convolutional neural network (CNN or ConvNet) is a type of artificial neural network that was initially developed by Yann LeCun (and friends) for image recognition. As for the regular neural network, they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a linear operation (input * weights + bias) and optionally follows it with a non-linear activation function (generally a Sigmoid or a Relu). The major difference with respect to a regular neural network is that ConvNets are translation invariant. This an important property for object recognition in images because we want the neural network to identify the object regardless on its position on the image. Intuitively, one way to achieve this requirement would be to divide the picture into many smaller-size pads and apply a unique (reduced) neural network on each of these pads. Therefore, if the object that we are trying to identify is present on the bottom-left corner of the image instead of the top-right corner, we would still be capable of finding it as soon as we have a picture pad covering that region. We can actually go further and split the pads into smaller pads made up of the constituent of the objects we are trying to identify. Etc. Etc. This is the general idea behind convolutional neural networks.

ConvNets are generally made of two alternating types of layers:

Convolutional Layers
This is the layer that does most of the computational work. It is made of a certain number of learnable kernels/filters that are have reduced width/height extension compared to the size of the image but are sensitive to the full image depth (used to encode the color of the image pixels). During the forward pass of the network, the filters are locally convoluted with all possible image pads of the filter dimension. The local (convolution) output of the layer is then passed through an activation function (generally a Relu). When the learnable pattern of the filter matches the local pattern of the image, the output of this process is generally close to 1 otherwise it is close to 0. The convolutional layer output is, therefore, a local indicator of pattern identification on the picture.
The filter patterns are learned using regular neural network backpropagation and gradient descent technique.

Pooling Layers
The main goal of the pooling layer is to reduce the spatial size of the image representation (after a convolutional layer) in order to reduce the number of learnable parameters in subsequent layers. This layer is basically a non-linear downsampling of the image representation. Very often, the spatial dimension is reduced by a factor n=2 or 3 by keeping only the value of the largest pixel in a group of n x n pixels.

A ConvNet is generally made up of several pairs of convolutional + pooling layers and ended with one or two fully connected layers as we can find them in a multi-layer perceptron. The first convolutional layers generally learn to identify simple (very local) patterns like vertical lines, horizontal lines, diagonals, etc. While the following layers are made up of patterns of these simple patterns are good at identifying more complex patterns. For instance, could learn that a car is made up of "wheel" and "windshield" patterns. The purpose of the final layers is to make a prediction based on the presence of pattern at specific locations in the image representation. The figure bellow, taken from this excellent blog, shows the typical architecture of a ConvNet.

We stop here the ConvNet introduction as we don't need a deep theory understanding for the following of this blog. If you'd like to know more about CNN, the web is full of pedagogical introductive courses to ConvNets (see for instance this, this or this).

ImageNet and the VGG16 network

The total number of free parameters in a convNet is significantly smaller than for a multi-layer perceptron achieving the same job. Nonetheless, the number of parameters to learn in a modern convNet is generally still pretty large (order of billions) and require a large training set to avoid overfitting.

The purpose of the ImageNet database is to provide such training datasets in order to accelerate the research in image recognition. Thanks to imageNet, we have access to 15M pictures that are labeled by one or more category. They are currently about 20K different category label (or synsets). In addition, the ImageNet group organizes yearly image recognition challenges where the word class researchers can compare the performances of their algorithms. For this blog, we are in particular interested in the Task 2a of the 2014 contest, The goal was to classify 50K pictures among a 1000 different label categories. Identified object are also asked to be localized on the picture. A training set of 1.2M images was provided.

One of the main winners of this competition is the VGG16 algorithm that was developed by the famous Visual Geometry Group (VGG) of the University of Oxford. The algorithm is documented in great details in this scientific article. They are also many pedagogical introductions to the algorithm all over the web, so I am just going to give a brief summary here. The VGG16 algorithm is made up of 5 convolutional blocks followed by 3 fully-connected dense layers. The inputs are images of size 224 pixels (width) x 224 pixels (height) x 3 color components (depth) and the output is a vector of 1000 components giving the probability that the input image belong to each of the possible label categories. The network is made of 16 "active" layers that are listed hereafter. The number in parenthesis provides the tensor dimensions of each layer outputs:

Input Layer (224x224x3)
CNN Block 1
- Convolutional Layer1 (224x224x64)
- Convolutional Layer2 (224x224x64)
- Pooling Layer (112x112x64)
CNN Block 2
- Convolutional Layer1 (112x112x128)
- Convolutional Layer2 (112x112x128)
- Pooling Layer (56x56x128)
CNN Block 3
- Convolutional Layer1 (56x56x256)
- Convolutional Layer2 (56x56x256)
- Convolutional Layer3 (56x56x256)
- Pooling Layer (28x28x256)
CNN Block 4
- Convolutional Layer1 (28x28x512)
- Convolutional Layer2 (28x28x512)
- Convolutional Layer3 (28x28x512)
- Pooling Layer (14x14x512)
CNN Block 5
- Convolutional Layer1 (14x14x512)
- Convolutional Layer2 (14x14x512)
- Convolutional Layer3 (14x14x512)
- Pooling Layer (7x7x512)
Flatten Layer (25088 = 7x7x512)
Dense Layer (4096)
Dense Layer (4096)
Dense (Softmax) Layer (1000)

The accuracy of the algorithm at identifying 1000 different objects on images is about 93%. This is a very high accuracy when we know that the accuracy of humans for the same task is about 95% as explain in this paper. Today, they are more sophisticated algorithms that can even outperform human performances for such tasks.

Now that we know the VGG16 network configuration and that we have a large publicly available training set, we are ready to train the algorithm... We only need to use the 1.5million training set to learn 139 million learnable parameters through backpropagation... Hum... that's going to really take a while on a personal computer. Computing resources are actually the bottleneck here....

Fortunately, pre-trained weights for the VGG16 algorithm (and many others) are publicly available on the Caffe Model Zoo web page. The python deeplearning library Keras even have automatic scripts to download the weights when needed. So we have in our hands one of the most accurate algorithms for image recognition that is already pre-trained and ready to use.

Transfer Learning

What is interesting with convolutional neural networks, is that each of the convolutional layers is learning specific image features (lines, colors, patterns, groups of pixels, etc.). It's only at the very end of the network that all these image features are connected to each other in view of tagging the entire image. So, although the end of the network is highly specialized at recognizing the thousand possible labels of the 2014 ImageNet contest, the lower parts of the network are more generic and could be recycled for other tasks without retraining. This is often referred to as "transfer learning".

Thanks to transfer learning, we could replace the last layer(s) of the network by something that is more dedicated to our use case (different than the 1000 ImageNet labels). For instance, if we are interested in finding whether our image contains either a Cat or a Dog, we would have only two different labels and our last layer could be made of only two neurons (compared to the 1000 of the VGG16). If we assume that the rest of the network should remain unmodified, we would have to learn only 8194 free parameters (corresponding to 2 bias value and all weight connections between the 4096 neurons of the one to the last layer with the 2 neurons of the last layer). This is a very small number of parameters to learn for such a complex task. We would, therefore, need a relatively small training dataset to train our specialized VGG16 algorithm.

Depending on the problem, the hypothesis the first VGG16 layers remains totally unchanged might not be always good. In these cases, we may want to entirely re-train the algorithm and therefore find the optimal values for all of the 139M parameters of the network. But even in that situation, the transfer learning is a useful thing as we could start with parameter values that are already close to their minimum. This would make the algorithm to converge much much faster. Training time could easily be reduced from weeks to hours timescale.

Unsupervised usage of the VGG16 algorithm for Reverse Image Search

So far, we have explained how we can train the VGG16 using super large training dataset, and how to "recycle" pre-trained model in order to accommodate a smaller training set to perform simpler tasks. But, we can go further and recycle the VGG16 pre-trained model to perform a very useful task that would not even need a small retraining. This would, therefore, move this to the category of "unsupervised" model that is immediately ready to use.

If we drop the very last layer of the VGG16 (as shown in the figure bellow), the output of the network is a vector of 4096 components. Those components are the one that is used by VGG16 to make the difference between cats, dogs, cars and many other types of images. Thus, they contain lots of information about the content of the image in a relatively compact way. We like to see the output of the vector as the DNA of the picture that is analyzed by the last layer to name the content of the image. So, somehow after this layer, the object is already "uniquely" identified... it's just that we don't know how to call it.

In the following, we would refer to this modified VGG16 algorithm as the "topless" VGG16.

They are many tasks that actually do not require to name the content of the image. Think for instance to a task where you have an image and you'd like to identify all other images that look like this one. The topless VGG16 algorithm would be very good at this, as similar images would have similar DNA content. So the complex problem of finding similar images in a collection is transformed to the trivial problem of finding similar vectors. This is achieved very easily by linear algebra and more precisely by the dot product of the two vectors. If you are not familiar with this concept, you can search for cosine similarity on the web.

Most of the Reverse Image Search services, like TinEye or Google image search, rely on such technique for comparing images.

In the following of this blog, we will show how the technique perform in a simple example.

Demonstration of the reverse image search on a deck of card

We will show the performance of the topless VGG16 algorithms for identifying similar cards out of two shuffled decks of cards. In order to make the exercise a bit more tricky, we used two decks of cards with very different image styles and very different image resolutions and dimensions. The first one as 166x116 dimensions, while the second one has 91x67 dimensions. Both will be resized to 224x224 dimensions has this is the expected input size of the VGG16 algorithm. The figure bellow shows the 104 cards that we are going to use.

Every card is passed through the topless VGG16 in order to get its associated DNA vector. The vectors are then normalized to norm=1 and used to compute the similarity between the ith card and the jth card by computing the cosine similarity (dot product) of the two vectors. Finally, we can build the similarity matrix which would contain the similarity between all pair of cards. The matrix is analyzed to find out what are the 5 most similar cards to a given one.

Let's look at the most similar cards (as predicted by the topless VGG16 also) for a few examples. In the pictures here after, the left most card is always the reference card for which we are trying to identify the top-5 most similar cards. The value of the cosine similarity between the two picture DNA vectors is also given at the top of most similar images.

Example 1: The Ace of Diamonds
In both decks of cards, the most similar card to the ace of diamonds is identified as the ace of hearts of the same deck. Interestingly, for the second most similar card, we have different results depending on the deck. For deck1, the ace of diamonds of the other deck is tagged as the second to most similar. For deck2, on the contrary, it spots the 2 of diamonds as being more similar than the ace of the other deck. In both decks, the 2 and 3 of diamonds are marked as rather similar to the ace, which makes sense as in all cases, there are diamond symbols in the top left and bottom right of the card, and the overall card symmetry is similar.

Example 2: The Five of Diamonds
In both decks of cards, the most similar card to the five of diamonds is identified as the six of diamonds in the same deck. Interestingly, the following cards are similar in both decks and have all a rather close similarity value, which explain why the order is not necessarily identical. All these cards actually have a value that is close to 5: 3,4,6,7. This makes perfect sense.

Example 3: The Jack of Diamonds
Here, the image on the cards become much more complex and the behavior of the algorithm is, therefore, a bit less clear. What we noticed, is that the direction of the character face, the color of the character hair, the style of the weapons and the clothes colors are all playing an important role in the card similarity. These features matter much more than the card color. Wich also makes perfect sense as these features occupy quite a large part of the picture (much more than the card color symbols). We can also notice that the similarity between cards is also lower than in the previous case.

Example 4: The Queen of Diamonds
Similarly, to what was discussed in example 3, as the images are quite rich in terms of features, it is more difficult to understand the deep reasons behind the similarity scores. But clearly, the clothes styles seems to have quite some importance here.

Example 5: The King of Diamonds
Finally, we can have a look at the results for the king of diamonds. Interestingly, in the deck1, it identifies left looking "male" cards, then the right looking "male" and finally the queen of diamonds. In the second deck, the resolution of the picture is much worse, but we despite that we can see a similar kind of behavior.

Conclusion

In conclusion, we have seen that convolutional neural networks are quite large neural networks with hundreds of millions of learnable parameters that are generally hard to train by individuals with limited computing resources. However, we have also explained that transfer learning allows individuals to recycle pre-trained models for other purposes with minimal (re)training or even without retraining at all for the case of reverse image search. We have demonstrated that the convolutional neural networks are capable of extracting the features associated with an image into the form of a picture DNA vector. Comparing these DNA vectors allows identifying pictures that have the similar type of features and that are therefore similar to each other.

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Sparkling Water on the Spark-Notebook

loic.quertenmont@gmail.com (Loic Quertenmont) — Tue, 11 Apr 2017 06:30:47 +0000

Note: This blog post was written as a collaboration between Kensu.io and H2O.ai and the blog content was initially posted on on blog.H2O.ia. You can either read it here, or continue your reading on its original publication page.

In the space of Data Science development in enterprises, two outstanding scalable technologies are Spark and H2O. Spark is a generic distributed computing framework and H2O is a very performant scalable platform for AI. Their complementarity is best exploited with the use of Sparkling Water. Sparkling Water is the solution to get the best of Spark – its elegant APIs, RDDs, multi-tenant Context and H2O’s speed, columnar-compression and fully-featured Machine Learning and Deep-Learning algorithms in an enterprise ready fashion. Examples of Sparkling Water pipelines are readily available in the H2O github repository, we have revisited these examples using the Spark-Notebook.

The Spark-Notebook is an open source notebook (web-based environment for code edition, execution, and data visualization) focused on Scala and Spark. This is a notebook comparable to Jupyter. The Spark-Notebook is part of the Adalog suite of Kensu.io which addresses agility, maintainability and productivity for data science teams. Adalog offers to data scientists a short work cycle to deploy their work to the business reality and to managers a set of data governance giving a consistent view on the impact of data activities on the market.

This new material allows diving into Sparkling Water in an interactive and dynamic way.

Working with Sparking Water in the Spark-Notebook scaffolds an ideal platform for big data /data science agile development. Most notably, this gives the data scientist the power to:

Write rich documentation of his work alongside the code, thus improving the capacity to index knowledge
Experiment quickly through interactive execution of individual code cells and share the results of these experiments with his colleagues.
Visualize the data he/she is feeding H2O through an extensive list of widgets and automatic makeup of computation results.

Most of the H2O/Sparkling water examples have been ported to the Spark-Notebook and are available in a github repository.

We are focussing here on the Chicago crime dataset example and looking at:

How to take advantage of both H2O and Spark-Notebook technologies,
How to install the Spark-Notebook,
How to use it to deploy H2O jobs on a spark cluster,
How to read, transform and join data with Spark,
How to render data on a geospatial map,
How to apply deep learning or Gradient Boosted Machine (GBM) models using Sparkling Water

Installing the Spark-Notebook:

Installation is very straightforward on a local machine. Follow the steps described in the Spark-Notebook documentation and in a few minutes, you will have it working. Please note that Sparkling Water works only with Scala 2.11 and Spark 2.02 and above currently.
For larger projects, you may also be interested to read the documentation on how to connect the notebook to an on-premise or cloud computing cluster.

The Sparkling Water notebooks repo should be cloned in the “notebooks” directory of your Spark-Notebook installation.

Integrating H2O with the Spark-Notebook:

In order to integrate Sparkling Water with the Spark-Notebook, we need to tell the notebook to load the Sparkling Water package and specify custom spark configuration, if required. Spark then automatically distributes the H2O libraries on each of your Spark executors. Declaring Sparkling Water dependencies induces some libraries to come along by transitivity, therefore take care to ensure duplication or multiple versions of some dependencies is avoided.
The notebook metadata defines custom dependencies (ai.h2o) and dependencies to not include (because they’re already available, i.e. spark, scala and jetty). The custom local repos allow us to define where dependencies are stored locally and thus avoid downloading these each time a notebook is started.

"customLocalRepo": "/tmp/spark-notebook",
"customDeps": [
  "ai.h2o % sparkling-water-core_2.11 % 2.0.2",
  "ai.h2o % sparkling-water-examples_2.11 % 2.0.2",
  "- org.apache.hadoop % hadoop-client %   _",
  "- org.apache.spark  % spark-core_2.11    %   _",
  "- org.apache.spark % spark-mllib_2.11 % _",
  "- org.apache.spark % spark-repl_2.11 % _",
  "- org.scala-lang    %     _         %   _",
  "- org.scoverage     %     _         %   _",
  "- org.eclipse.jetty.aggregate % jetty-servlet % _"
],
"customSparkConf": {
  "spark.ext.h2o.repl.enabled": "false"
},

With these dependencies set, we can start using Sparkling Water and initiate an H2O context from within the notebook.

Benchmark example – Chicago Crime Scenes:

As an example, we can revisit the Chicago Crime Sparkling Water demo. The Spark-Notebook we used for this benchmark can be seen in a read-only mode here.

Step 1: The Three datasets are loaded as spark data frames:

Chicago weather data : Min, Max and Mean temperature per day
Chicago Census data : Average poverty, unemployment, education level and gross income per Chicago Community Area
Chicago historical crime data : Crime description, date, location, community area, etc. Also contains a flag telling whether the criminal has been arrested or not.

The three tables are joined using Spark into a big table with location and date as keys. A view of the first entries of the table are generated by the notebook’s automatic rendering of tables (See a sample on the table below).

Geospatial charts widgets are also available in the Spark-Notebook, for example, the 100 first crimes in the table:

Step 2: We can transform the spark data frame into an H2O Frame and randomly split the H2O Frame into training and validation frames containing 80% and 20% of the rows, respectively. This is a memory to memory transformation, effectively copying and formatting data in the spark data frame into an equivalent representation in the H2O nodes (spawned by Sparkling Water into the spark executors).
We can verify that the frames are loaded into H2O by looking at the H2O Flow UI (available on port 54321 of your spark-notebook installation). We can access it by calling “openFlow” in a notebook cell.

Step 3: From the Spark-Notebook, we train two H2O machine learning models on the training H2O frame. For comparison, we are constructing a Deep Learning MLP model and a Gradient Boosting Machine (GBM) model. Both models are using all the data frame columns as features: time, weather, location, and neighborhood census data. Models are living in the H2O context and thus visible in the H2O flow UI. Sparkling Water functions allow us to access these from the SparkContext.

We compare the classification performance of the two models by looking at the area under the curve (AUC) on the validation dataset. The AUC measures the discrimination power of the model, that is the ability of the model to correctly classify crimes that lead to an arrest or not. The higher, the better.

The Deep Learning model leads to a 0.89 AUC while the GBM gets to 0.90 AUC. The two models are therefore quite comparable in terms of discrimination power.

Step 4: Finally, the trained model is used to measure the probability of arrest for two specific crimes:

A “narcotics” related crime on 02/08/2015 11:43:58 PM in a street of community area “46” in district 4 with FBI code 18.
The probability of being arrested predicted by the deep learning model is 99.9% and by the GBM is 75.2%.
A “deceptive practice” related crime on 02/08/2015 11:00:39 PM in a residence of community area “14” in district 9 with FBI code 11.
The probability of being arrested predicted by the deep learning model is 1.4% and by the GBM is 12%.

The Spark-Notebook allows for a quick computation and visualization of the results:

Summary

Combining Spark and H2O within the Spark-Notebook is a very nice set-up for scalable data science. More examples are available in the online viewer. If you are interested in running them, install the Spark-Notebook and look in this repository. From that point , you are on track for enterprise-ready interactive scalable data science.

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

R vs Python vs Scala vs Spark vs TensorFlow... The quantitative answer!

loic.quertenmont@gmail.com (Loic Quertenmont) — Mon, 06 Mar 2017 07:05:23 +0000

In this blog, we will finally give an answer to THE question: R, Python, Scala, Spark, Tensorflow, etc... What is the best one to answer data science questions? The question itself is totally absurd, but they are so many people asking it on social network that we find it worth to finally answer the recurrent question using a scientific methodology. At the end of this blog, you will find a quantitative answer comparing the computing time of each language/library for fitting the exact same Generalized Linear Model (GLM). Many features matter in the choice of a language/library, among them , the computing and developing time are for sure very important criteria.

The methodology that we are adopting to answer this famous question is, therefore, considering both the performance of the tool in terms of computing time but also the easiness of the tool in terms of data exploration, model definition etc. The quality of the tool documentation is also an important factor that contributes to faster development time cycles. Finally, scalability of the tool with respect to the size of the dataset is also an important factor in the big data era.

Model and Dataset

The dataset that we are using in these comparisons is the airline dataset which contains information about flight details since 1987. The dataset is publicly available on the website of the American Statistical Association. The dataset is made of 29 columns, 7009728 rows and weights 658MB on disk. We will use a 1M flight details from year 2008 as our benchmark dataset. In this post, we will use each tool to put together a model to predict if a flight will arrive on time. The model prediction is based on 9 columns of the input dataset:

Departure date of the flight (Year, Month, DayOfMonth and, DayOfWeek),
Departure time
Air time
Distance
Origin airport
Destination airport

We will use a General Linear Model (GLM) to make the prediction. The GLM is also known as logistics regression, or logit regression. This is quite a basic model which is extensively used in everyday business and which is available in all tools. It therefore offers a nice point of comparison for the types of models that matters today.

It's time to remind that the objective of this blog is not to define the most accurate model for predicting if a flight will arrive on time. The main objective of this blog is to offer a quantitative comparison of data science tools in terms of computing performances and coding easiness. In other words, it is on purpose that we use a simple model with a limited number of inputs.

Comparison Procedure

All the comparisons are made in jupyter notebooks using the exact same analysis flow. The notebooks are available on GitHub and could be visualized on NBviewer. Note: If you would like to add comparison to other tools or dataset, feel free to push your results to the GitHub directory. The analysis flow that we repeat for all tools is the following:

Import the library
Load and explore the dataset
Prepare the dataset for training
Define and fit the model
Test the model and Measure the accuracy

Tools (Library/Languages)

R is the installed language in the academic and data science community. SAS is its main commercial competitor. R is considered to be good for pure statistical analysis and it is open source. However, languages like Python and Scala are eating out market share of R. A recent survey by Burtch Works shows how Python gain steam in the analytics community, at the expense of R and proprietary packages like SAS, IBM‘s SPSS, and Mathworks‘ Matlab. [caption id="attachment_1561" align="aligncenter" width="300"] Source Burtch Works[/caption] At the time of writing this blog, we have analyzed the following languages/libraries:

R
Python3 + Scikit-learn
Python3 + Tensorflow
Python3 + Keras
Scala + Spark

Data manipulations in Python and Scala are done using Pandas and Spark data frame libraries, respectively. Both are inspired by R data frame tool and are therefore very similar. Spark library could also be used in Python, but we prefer to test it using Scala language as this is the native language of the Spark library. We should also highlight that although Spark can be used to analyze small dataset like this one, it was initially designed for the analysis of datasets that are so big that they can not be analyzed efficiently on one single computer.

Data Loading and Exploration

In this section, we compare the code complexity for the following tasks:

Load a dataset from a CSV file
Print the total number of rows in the dataset
Trim the dataset to the first million rows
List the column in the dataset
Display the first 5 rows on in the jupyter output cell.

From the code snippet bellow, we can see that R and python have almost the same syntax. Python (and its libraries Pandas and Numpy are a bit more object oriented which make their usage easier). Spark equivalent code is a bit more complicated but offers the advantage to handle distributed data loading from Hadoop File System (HDFS).

In R:

df = read.csv("2008.csv") #Load data
nrow(df)                  #Get number of rows
df = df[0:1000000,]       #Keep the first 1M rows
names(df)                 #List columns
df[0:5, ]                 #Get the 5 rows

In Python3:

df = pd.read_csv("2008.csv") #Load data
df.shape[0]                  #Get number of rows
df = df[0:1000000]           #Keep the first 1M rows
df.columns                   #List columns
df[0:5]                      #Get the 5 rows

In Scala:

val dffull = sqlContext.read.format("csv")
                       .option("header", true)
                       .option("inferSchema", true)
                       .load("2008.csv")              //Load data
dffull.count                                          //Get number of rows
val df = dffull.sample(false, 1000000.toFloat/count)  //Keep the first 1M rows
df.printSchema()                                      //List columns
df2.show(5)                                           //Get the 5 rows

Data Preparation for Model training

In this section, we compare the code complexity for selecting the columns of interest for our model, encode categorical variables and split the dataset into a training sample and a testing sample.

From the code snippet bellow, we can see that R and python have again almost the same syntax. Spark data processing is a bit different. It relies on the Spark ML Pipelines mechanism which allows better optimization of distributed calculus.

In R:

#drop rows where delay column is na
df = df[is.na(df$ArrDelay)==0,]
#turn label to numeric
df["IsArrDelayed" ] <- as.numeric(df["ArrDelay"]>0)
#mark as categorical
df["Origin"       ] <- model.matrix(~Origin       , data=df)
df["Dest"         ] <- model.matrix(~Dest         , data=df)
#split the dataset in two parts
trainIndex = sample(1:nrow(df), size = round(0.8*nrow(df)), replace=FALSE)
train = df[ trainIndex, ]
test  = df[-trainIndex, ]

In Python3:

#drop rows where delay column is na
df = df.dropna(subset=["ArrDelay"])
#turn label to numeric
df["IsArrDelayed" ] = (df["ArrDelay"]>0).astype(int)
#Mark as categorical (replace by one hot encoded version)
df = pd.concat([df, pd.get_dummies(df["Origin"], prefix="Origin")], axis=1);
df = pd.concat([df, pd.get_dummies(df["De, axis=1);
#split the dataset in two parts
train = df.sample(frac=0.8)
test  = df.drop(train.index)

In Scala:

//build a pipeline to turn categorical variables to encoded version
//and to build a feature vector concatenating all training column into a vector
val OriginIndexer = new StringIndexer()
.setInputCol("Origin")
.setOutputCol("OriginIndex")
val OriginEncoder = new OneHotEncoder()
.setInputCol("OriginIndex")
.setOutputCol("OriginVec")
val DestIndexer = new StringIndexer()
.setInputCol("Dest")
.setOutputCol("DestIndex")
val DestEncoder = new OneHotEncoder()
.setInputCol("DestIndex")
.setOutputCol("DestVec")
val Assembler = new VectorAssembler()
.setInputCols(Array("Year","Month",  "DayofMonth" ,"DayOfWeek", "DepTime", "AirTime", "Distance", "OriginVec", "DestVec"))
.setOutputCol("Features")
val pipeline = new Pipeline()
.setStages(Array(OriginIndexer, OriginEncoder, DestIndexer, DestEncoder, Assembler))
//Transform the dataset using the above pipeline
val Preparator = pipeline.fit(df2)
val dfPrepared = Preparator.transform(df2).cache()
//Split the dataset in two parts
val Array(train, test) = dfPrepared.randomSplit(Array(0.8,0.2))

Model building and training

In this section, we jump into the heart of the statistical part of the code. Although, The length of the code to define and use a GLM model is quite small for most of the tools, the time needed to execute these few lines of code can be quite long depending on the library (and it's underlying optimization), on the model complexity and on the size of the dataset itself. Fixing the free parameters of a GLM model requires iterating over the dataset until all free parameters are converging to a unique value. This procedure is often referred to as the fitting of the model.

They are generally several library options available for fitting a model in a given language. This is particularly true in Python for which new data science and deep learning libraries are developed every day. Among those, scikit-learn is the reference for many years for all data science algorithms. Deep learning libraries like Google Tensorflow and Keras are also gaining in popularity and offers the possibility to exploit the Graphical Processing Unit (GPU) for faster model fitting. Keras uses either Tensorflow (or Theano) as a back-end for the model fitting but it makes the programming a bit easier for common statistical model and algorithms.

Note: For Tensorflow, we need to decompose the GLM model into simple matrix operations, so the code is a bit more lengthy. For those who need a reminder, the GLM model has linear logits which are linear with respect to the model features X. Logits = (X*W)+B = (Features * Coefficients) + Bias. The model predictions are defined as the sigmoid of the logits. The model loss function (used to optimize the model parameters W and B) is a logistic function.

In the following, you should pay attention at the code complexity, but also at the time it took for fitting the model. The model predicting power will be discussed in the section.

All test were run on a dataset of approximately 80% x 1M rows and on the same computer powered by an Intel I7-6700K @4.0GHz (8cores) and a GTX970 with 4GB GPU. The GPU is only used in Tensorflow/Keras benchmarks. Spark was used in a local model, so it uses the 8 cores of the processor, but nothing more.

In R: The model fitting took ~19min

#define the model and fit it
model <- glm(IsArrDelayed ~ Year + Month + DayofMonth + DayOfWeek + DepTime + AirTime + Origin + Dest + Distance,data=train,family = binomial)

In Python3 with Scikit-learn: The model fitting took ~13sec

#define the model feature columns and the label column
OriginFeatCols = [col for col in df.columns if ("Origin_" in col)]
DestFeatCols   = [col for col in df.columns if ("Dest_"   in col)]
features = train[["Year","Month",  "DayofMonth" ,"DayOfWeek", "DepTime", "AirTime", "Distance"] + OriginFeatCols + DestFeatCols  ]
labels   = train["IsArrDelayed"]
#define the model per itself (C is the inverse of L2 regularization strength
model = LogisticRegression(C=1E5, max_iter=10000)
#fit the model
model.fit(features, labels)

In Python3 with Tensorflow: The model fitting took ~11sec

featureSize = features.shape[1]
labelSize   = 1
training_epochs = 25
batch_size = 2500
 
#Define the model computation graph
graph = tf.Graph()
with graph.as_default():
# tf Graph Input
LR = tf.placeholder(tf.float32 , name = 'LearningRate')
X = tf.placeholder(tf.float32, [None, featureSize], name="features") # features
Y = tf.placeholder(tf.float32, [None, labelSize], name="labels")   # training label
 
# Set model weights
W = tf.Variable(tf.random_normal([featureSize, labelSize],stddev=0.001), name="coefficients")
B = tf.Variable(tf.random_normal([labelSize], stddev=0.001), name="bias")
 
# Construct model
logits = tf.matmul(X, W) + B
with tf.name_scope("prediction") as scope:
P      = tf.nn.sigmoid(logits)
 
# Cost function and optimizer (Minimize error using cross entropy)
L2  = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(targets=Y, logits=logits) ) + 1E-5*L2
optimizer = tf.train.AdamOptimizer(LR).minimize(cost)
 
# Initializing the variables
init = tf.initialize_all_variables()
 
#Fit the model (using a training cycle with early stopping)
avg_cost_prev = -1
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(features.shape[0]/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_xs = featuresMatrix[i*batch_size:(i+1)*batch_size]#features[i*batch_size:(i+1)*batch_size].as_matrix()
batch_ys = labelsMatrix[i*batch_size:(i+1)*batch_size]#labels  [i*batch_size:(i+1)*batch_size].as_matrix().reshape(-1,1)
 
#set learning rate
learning_rate = 0.1 * pow(0.2, (epoch + float(i)/total_batch))
 
# Fit training using batch data
_, c = sess.run([optimizer, cost], feed_dict={X: batch_xs, Y: batch_ys, LR:learning_rate})
 
# Compute average loss
avg_cost += c / total_batch
 
#check for early stopping
if(avg_cost_prev>=0 and (abs(avg_cost-avg_cost_prev))<1e-4):
break
else: avg_cost_prev = avg_cost

In Python3 with Keras: The model fitting took ~55sec

featureSize     = features.shape[1]
labelSize       = 1
training_epochs = 25
batch_size      = 2500

from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.regularizers import l2, activity_l2
from sklearn.metrics import roc_auc_score
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping

#DEFINE A CUSTOM CALLBACK
class IntervalEvaluation(Callback):
    def __init__(self): super(Callback, self).__init__()
    def on_epoch_end(self, epoch, logs={}): print("interval evaluation - epoch: %03d - loss:%8.6f" % (epoch, logs['loss']))

#DEFINE AN EARLY STOPPING FOR THE MODEL
earlyStopping = EarlyStopping(monitor='loss', patience=1, verbose=0, mode='auto')
        
#DEFINE THE MODEL
model = Sequential() 
model.add(Dense(labelSize, input_dim=featureSize, activation='sigmoid', W_regularizer=l2(1e-5))) 
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy']) 

#FIT THE MODEL
model.fit(featuresMatrix, labelsMatrix, batch_size=batch_size, nb_epoch=training_epochs,verbose=0,callbacks=[IntervalEvaluation(),earlyStopping]);

In Scala: The model fitting took 44sec

//Define the model
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
.setLabelCol("IsArrDelayed")
.setFeaturesCol("Features")
//Fit the model
val lrModel = lr.fit(train)

As for the other parts, R and python with the (almost native) scikit-learn library are very similar in terms of code complexity. What is quite astonishing is the huge difference in computing time in favor of python. A 15 sec python training translate to a 19 min equivalent training in R. R is ~90 times slower than python with scikit-learn and 25 times slower than Spark.

The fact that spark is a bit slower than python is expected here because Spark communication among each computing nodes is slowing down the process a bit. Here the computing nodes correspond to the 8 cores of the same processor, but still, the communication unit of spark is still eating up a bit of the resource. Spark would start to show an advantage for much larger datasets (not fitting entirely in the computer RAM) or when used on a computing cluster made of dozens of computing nodes or more.

Similarly, Keras seems slower than Tensorflow alone. We imagine that the cause is also due to a sort of overhead in the communication between Keras front-end and (Tensorflow) backend. This overhead is probably negligible for complex deep learning models with a dozen of layers or even more which are very different from the extremely simple GLM model that we are using for this test. Moreover, we limited out GLM Tensorflow model implantation to its most minimal form. It is, therefore, hard for Keras to do better.

Tensorflow computing time is almost identical to what we obtained with scikit-learn despite the usage of a GPU with more than 1664 CUDA computing cores. This is again due to the simplicity of the model. The time spent in the model optimization during the training iteration is actually going very fast. The vast majority of the time is actually spent in transferring the training data to the graphic card and gathering back the results from the GPU. More complex models would, therefore, be trained within basically the same amount of time given that the bottleneck here is not the computing power. They are better Tensorflow implantation that could solve this issue and reduce the training time. One of them is to use input Queues. But we have not explored this solution for this blog.

Model testing and Accuracy

In this section, we show compare code snippets to get the model prediction on the testing dataset, to draw the model ROC curve, and to get the model Area Under the Curve (AUC) which is a good indicator of the model classification performance. The higher the AUC the better. But since all the models are using the same input features and the same dataset, we expect the AUC of all the model implantations to be identical. There might be some small difference due to the randomness of the optimization process itself or due to slightly different stopping conditions.

In R: AUC=0.706

#Get the predictions
test["IsArrDelayedPred"] <- predict(model, newdata=test, type="response")
#Compare prediction with truth and draw the ROC curve
pred <- prediction(test$IsArrDelayedPred, test$IsArrDelayed)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
#Get the AUC
AUC = performance(pred, measure = "auc")@y.values</pre>
<h3>In Python3:
with Scikit-learn: AUC=0.702
with Tensorflow: AUC=0.699
with Keras: AUC=0.689</h3>
<pre>
#Get the predictions
testFeature = test[["Year","Month",  "DayofMonth" ,"DayOfWeek", "DepTime", "AirTime", "Distance"] + OriginFeatCols + DestFeatCols  ]
test["IsArrDelayedPred"] = model.predict( testFeature.as_matrix() )
#Compare prediction with truth and draw the ROC curve
fpr, tpr, _ = roc_curve(test["IsArrDelayed"], test["IsArrDelayedPred"])
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=4, label='ROC curve')
#Get the AUC
AUC = auc(fpr, tpr)

In Scala with Spark: AUC=0.645

//Get the predictions
val testWithPred = lrModel.transform(test.select("IsArrDelayed","Features"))
//Compare prediction with truth and draw the ROC curve
val trainingSummary = lrModel.evaluate(test)
val binarySummary = trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
val roc = binarySummary.roc
plotly.JupyterScala.init()
val fpr = roc.select("FPR").rdd.map(_.getDouble(0)).collect.toSeq;
val tpr = roc.select("TPR").rdd.map(_.getDouble(0)).collect.toSeq;
plotly.Scatter(fpr, tpr, name = "ROC").plot(title = "ROC Curve")
//Get the AUC
println(s"areaUnderROC: ${binarySummary.areaUnderROC}")

All implantations are getting quite similar AUC values. Scala with Spark is a bit behind, but the model parameters have not been tuned at all. We could certainly improve this results by tuning the model convergence, regularization and early stopping parameters of the model training. As the score still remain relatively close to the other implantation, we consider this result as satisfactory for this comparison blog.

Summary

In summary, we have shown that although code complexity is very similar between R and python implantations of the GLM model, the computing time necessary for training the model is significantly higher in the case of the R implantation. The accuracy of the model itself is about the same. It is clear that R paved the way to modern statistical and data science toolkits, but unfortunately he seems to be left behind compared to more modern framework like Python or Spark. We would therefore recommend to move your data science framework to the Python language which offers much better performances than R, with a coding style very similar to what you are used to do with R data frames. You also get the extra advantage to benefits from the developments of the deep-learning libraries that are produced daily in python. Those brings marginal improvement in the case of simple models like GLM, but seriously boost your performance in the case of more complex models (i.e. several computation layers).

	Code Complexity	Computing Time (sec)	AUC	Documentation Quality	Additional Remarks
R	*	1140	0.71	GOOD	Much slower
Python3 Scikit-learn	*	13	0.70	EXCELLENT	The winner for GLM
Python3 Tensorflow	**	11	0.70	GOOD	Exploit GPU
Python3 Keras	*	55	0.69	GOOD	Exploit GPU; Good for complex models
Scala Spark	**	44	0.65	GOOD	Good for large dataset

Note: We'd like to extend this comparison table (so as my GitHub directory) with more frameworks. We would, in particular, be interested in comparisons with commercial software (SAS, SPSS, etc.). Contact us if you want to help.

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Scalable Geospatial data analysis with Geotrellis, Spark, Sparkling-Water and, the Spark-Notebook

loic.quertenmont@gmail.com (Loic Quertenmont) — Fri, 03 Mar 2017 07:05:22 +0000

Note: This blog post was initially written for the blog of Kensu.io, You can either read it here, or continue your reading on its original publication page.

This blog shows how to perform scalable geospatial data analysis using Geotrellis, Apache Spark, Sparkling-Water and the Spark-Notebook.

As a benchmark for this blog, we use the 500 images (and 45GB) dataset distributed by Kaggle/DSTL.

After reading this blog post, you will know how to:

Load GeoJSON and GeoTIFF files with Geotrellis,
Manipulate/resize/convert geospatial rasters using Geotrellis,
Distribute geospatial pictures analysis on a spark cluster,
Display geospatial tiles in the Spark-Notebook,
Create multispectral histogram from a distributed image dataset,
Cluster image pixels based on multi-spectral intensity information,
Use H2O Sparkling-Water to train a machine learning algorithm on a distributed geospatial dataset,
Use a trained model to identify objects on large geospatial images,
How to vectorize object rasters into polygons and save them to distributed (parquet) file systems

A Little Background

GeoTrellis is a geographic data processing engine for high-performance GIS applications. It comes with a number of functions to load/save rasters on various file systems (local, S3, HDFS and more), to rasterize polygons, to vectorize raster images, and, to manipulate raster data, including cropping/warping, Map Algebra operations, and rendering operations.

The Spark-Notebook is an open source notebook (web-based environment for code edition, execution, and data visualization), focused on Scala and Spark. It is thus well suited for enterprise environments, providing Data Scientists and Data Engineers with a common interactive environment for development and scalable machine learning. The Spark-Notebook is part of the Adalog suite of Kensu.io which addresses agility, maintainability, and productivity for data science teams. Adalog offers to data scientists a short work cycle to deploy their work to the business reality and offers to managers a set of data governance giving a consistent view on the impact of data activities on the market.

Sparkling-Water is the solution to get the best of Spark – its elegant APIs, RDDs, multi-tenant Context and H2O’s speed, columnar-compression and fully-featured Machine Learning and Deep-Learning algorithms in an enterprise-ready fashion

The environment

Installing the Spark-Notebook:

Just follow the steps described in the Spark-Notebook documentation and in less than 5 minutes you’ll have it working locally. For a larger project, you may also be interested in reading the documentation on how to connect the notebook to an on-premise or cloud computing cluster.

Integrating Geotrellis and Sparkling-Water:

In order to integrate Geotrellis and Sparkling-Water with the Spark-Notebook, we need to tell the notebook to load the library dependencies. After this, Spark will automatically distribute the libraries to the spark executors on the cluster. Possible conflicts caused by different version of spark shipped by the Notebook and Sparkling-Water are handled by editing the notebook meta-data like this:

"customRepos": [
"osgeo % default % http://download.osgeo.org/webdav/geotools/ % maven"
],
"customDeps": [
"org.locationtech.geotrellis % geotrellis-spark_2.11 % 1.0.0",
"org.locationtech.geotrellis % geotrellis-geotools_2.11 % 1.0.0",
"org.locationtech.geotrellis % geotrellis-shapefile_2.11 % 1.0.0",
"org.locationtech.geotrellis % geotrellis-raster_2.11 % 1.0.0",
"ai.h2o % sparkling-water-core_2.11 % 2.0.3",
"- org.apache.hadoop % hadoop-client % _",
"- org.apache.spark % spark-core_2.11 % _",
"- org.apache.spark % spark-mllib_2.11 % _",
"- org.apache.spark % spark-repl_2.11 % _",
"- org.scala-lang % _ % _",
"- org.scoverage % _ % _",
"- org.eclipse.jetty.aggregate % jetty-servlet % _"
],
"customSparkConf": {
"spark.ext.h2o.repl.enabled": "false",
"spark.ext.h2o.port.base": 54321,
"spark.app.name": "Notebook",
},

After this, we are done with setting up the environment and we can start using the notebook to answer business/data science questions.

Benchmark example

The notebooks we used to explore this dataset are visible here (read-only mode). The first one is used to explore the training dataset and perform machine-learning training. The second notebook is used to predict the object class types on the entire dataset. In this blog, we are only focusing on some specific parts of these notebooks. The files can also be downloaded from GitHub.

1) Description of the DSTL/Kaggle Dataset

The goal of the competition is to detect and classify the types of objects found in the image dataset. The full description of the competition and its dataset are available on the Kaggle website. Below is a short summary of the part of interest for this blog:

DSTL provides 1km x 1km satellite images in both 3-band and 16-band GeoTIFF formats. The images are coming from the WorldView 3 satellite sensor. In total, there are 450 images of which 25 have training labels.

The DSTL/Kaggle data that we are using consist of:

three_band: The 3-band images are the traditional RGB natural color images.It is labeled as “R” and has an intensity resolution of 11-bits/pixel and a spatial resolution of 0.31m.
sixteen-band: The 1+16-band images contain spectral information by capturing wider wavelength channels.
- The 1 Panchromatic band (450-800 nm) has an intensity resolution of 11-bits/pixel and a spatial resolution of 0.31m. It is labeled “P”.
- The 8 Multispectral bands from 400 nm to 1040 nm (red, red edge, coastal, blue, green, yellow, near-IR1 and near-IR2) has an intensity resolution of 11-bits/pixel and a spatial resolution of 1.24m. It is labeled “M”.
- The 8 short-wave infrared (SWIR) bands (1195 – 2365 nm) has an intensity resolution of 14-bits/pixel and a spatial resolution of 7.5m. It is labeled “A”.
grid_sizes.csv: the sizes of grids for all the images
train_geojson: GeoJSON files containing identified (multi-)polygons on the 25 training images. There are polygons of each of the 10 possible object class types used in this competition:
- Buildings
- Misc. Manmade structures
- Road
- Track
- Trees
- Crops
- Waterway
- Standing water
- Large Vehicle
- Small Vehicle.

2) GeoTIFF loading and image exploration

For this benchmark, the easiest is to build a Spark RDD in which the elements contain all the spectral information for a given image — the R, P, M and A (3+1+8+8) bands.

Since the intensity resolution differs in each band, we convert the images to a float format where the intensity of each pixel is in the range from 0 to 1.

Similarly, we resize all the images to the best space resolution (R and P dimensions) — approximately 3400 x 3400 pixels. A Bicubic-Spline interpolation is used during this process. Geotrellis functions are very efficient to load, convert and resize pictures.

Note: actually, R and P dimensions can slightly differ from one picture to another, so we can’t resize them all to the same unique dimension.

Finally, we align pictures from different bands such that the objects on the images are at the exact same pixel coordinates on all spectral bands. To do so, the resized images from band A, M and P are shifted by a horizontal and vertical offset with respect to the R band. The alignment constants are computing in an external script using the findTransformECC function of openCV which is particularly efficient at finding these offsets.

val processedTiffM = MultibandGeoTiff(path+"_M.tif")
 .tile.convert(DoubleConstantNoDataCellType).mapDouble{ (b,p) = p/2048.0}
 .resample(new Extent(0,0,nCols, nRows), nColsNew, nRowsNew, CubicSpline).

We do this for all the images of the training set and show the resulting images (from the R band tile). These miniatures allow us to pick-up an interesting benchmark example that we can use for detailed studies. In the following, we will use the image labeled 6120_2_2. It shows a village in a dusty desert.

For the selected image, we can show the intensity of each spectral band. We can immediately see that each spectral band contains complementary information of the same geographical location.

From this step, it’s already obvious that we can exploit the difference of the band intensity to categorize objects on the ground.

3) Object Polygons

For the images in the training set, DSTL/Kaggle also provides GeoJSON files indicating the location of identified objects on the ground. The JSON files contain coordinates of polygon vertices associated with the objects of a given class type.

The files are easily loaded thanks to Geotrellis (GeoJson.fromFile) functions. The library also offers functions to make a raster image out of the vectorial polygon information. We use the function “PolygonRasterizer.foreachCellByPolygon” to create masks of the various object class types visible on the images.

The image grid_sizes are used in this process in order to project the vectorial polygons on the raster images.

In the figure below, identified objects are shown as black pixels.

We can use those masks to select pixels associated to specific object class types from the 20 available spectral bands. These pixels will be used to build a machine learning algorithm trained to identify specific objects.

Before that, we will zoom in the top-left corner of the picture and overlay (in blue) the object polygons to the RGB picture in order to observe the level of details of the polygons (and of the picture). We can also observe that some pixels belong to more than one object class. Note for instance that the “trees” in the left part of the pictures are also part of a “crop”. The prediction algorithm that we are designing will, therefore, need to achieve multi-class tagging with the same level of details than what we see here.

4) Spectral Histograms per object types

At this stage, it is interesting to see how the raw image histogram (per spectral band) by the object type masks. This tells us which bands are useful to discriminate some object types.

For instance, We see that the near-IR2 band is particularly good at discriminating water, building, and crops from the rest. Other bands might be more performant for other objects.

The figure below shows the histogram for the 8 bands of the ‘M’ GeoTIFF

5) Model Learning

We will H2O Sparkling-Water to train Gradient Boosted Machine (GBM) models that discriminate pixels of one object class type from the rest, using a one-vs-all approach. As we have 10 possible class types, we train 10 different algorithms returning the probability that a pixel belongs to a given class type.

Note: Other approaches might lead to better performances at the cost of higher code complexity and training time.

To train the algorithms on class type, we create a dataset made of 200K randomly chosen pixels belonging to the class type and of 200K pixels not belong to it. For each pixel, we collect the intensity of the 3+1+8+8 spectral bands. This dataset is converted into an H2O Frame and split into a training dataset (90%) used to train the GBM algorithm and a validation dataset (10%) that is used to evaluate the performance of the trained algorithm to identify object type of the pixels.

The training is distributed on a Spark cluster via Sparkling-Water. The GBM is made of 100 trees with a maximal depth of 15.

The performance of the model is obtained from the Area Under Curve (AUC) computed on the validation dataset. The AUC is at best equal to 1.0 and a model is generally considered satisfactory when it has an AUC above 0.8. The operation is repeated for each class type. The table below summarizes the model AUC of each object class type.

Object Type	Model AUC
Buildings	0.992441
Manmade	0.966026
Road	0.997235
Track	0.930540
Trees	0.960790
Crops	0.983181
Waterway	0.999718
Standing Water	0.999889
Large Vehicles	0.999354
Small Vehicles	0.997031

The trained models are saved in the form of MOJO files which we could easily import in Scala/Java when we’ll want to use them.

6) Pixel Clustering and Model Prediction

The previous section closed the model learning part of this analysis. In the next sections, we will use the trained model to identify the objects on raw pictures (for which we don’t have the polygons).

Note: for comparison purposes, we keep using the image labeled 6120_2_2 as the benchmark example.

Running the model prediction on the 11.5M pixels of the 450 images of the dataset is extremely time-consuming.

Hint: because many pixels on each picture have very similar spectral intensity, we can save a lot of computing time by clustering similar neighboring clusters together and compute the model prediction at the level of the pixel cluster.

We develop a simple algorithm to aggregate adjacent clusters which have similar spectral information (with a tolerance of 3%). The cluster color is taken as the color average of the constituting pixels. In a second stage, small clusters (<50 pixels) are merged with the surrounding cluster with the closest color.

Finally, the previously trained models are used to predict the probability that an entire cluster belong to each of the 10 possible object class types.

The result of this algorithm is shown below for the 6120_2_2 image. The first row, shows the full image, while the middle and bottom rows are zoomed in the bottom-left and on the top-left corner of the image respectively. The three columns show, from left to right:

The R-band image: the image brightness was increased to better appreciate the objects on the image. This results in some (harmless) color glitches in the overexposed area.
The identified clusters on the image. Where the cluster are randomly colored using a 256 gray levels palette. In other words, color has no particular meaning here.
The identified clusters on the image colored according to the most probable class type of the object belonging to the cluster. An eleventh color level is also present for clusters belonging to none of the object class types. On these pictures, we can see that the shape of the objects in the picture is quite visible.

7) Mask Predictions

From this information, we can create a raster mask per object class type indicating the presence of an object or not. Overlap of object class types is handled by a set of ad hoc rules based on other class types probability for this cluster and for the surrounding ones.

Example, we know that the chances to find a truck in the middle of a waterway area are null. Similarly, having a tree on top of a road is unlikely. On the contrary, finding a car or a truck on a road is quite probable, so as finding a tree in the middle of a crop.

These rules are tuned by hand based on what is observed in the training dataset. More sophisticated rules taking, for instance, into account the size of the cluster could also be added. Having a crop cluster made of a few pixel or a car of thousands of pixels are both quite unlikely. But we didn’t push the exercise that far for this blog.
We overlay in blue the masks that we obtained on the bottom-left corner of R-band image. These predicted masks are directly comparable with those shown in section 3). From this, we can see that we are doing quite well for identifying Building, Crops, and Trees. For Manmade structures, and vehicles we overpredict quite a bit. And our algorithms has a hard time to make difference between road and tracks in this dusty environment.

These issues could be solved by implementing more complex rules for the class overlaps (as discussed earlier) and/or by completing the one-vs-all models by some dedicated one-vs-one models which could be used to solve the ambiguities between road vs tracks, large vs small vehicles or standing water vs waterway. Again, we didn’t push the exercise that far for this introductory blog.

8) Mask Vectorizing

In this last step, we convert the object raster masks into a list of polygons. The Geotrellis Tile.toVector function allows doing this quite easily. The image grid_sizes are used in this process in order to translate pixel coordinates into vectorial coordinates on the grid.

We notice that for complex rasters with holes (i.e. the mask of the crops shown above), the function may have difficulties in identifying the underlying polygons. In this case, we split the raster in 4 quarters and try to vectorize each of the quarters separately. This procedure is applied iteratively until the sub-quarter rasters get to small or the vectorizer succeeds at identifying the polygons.

Geotrellis also provides high-level functions to manipulate/modify the polygons. We can, for instance, simplify the polygons to make them smoother and reduce their memory/disk footprint.

Finally, the polygons are saved on disk either in GeoJSON format or in WKT format.

Summary

We have shown how to combine the Spark, Geotrellis, H2O Sparkling-Water and the Spark-Notebook to perform scalable geospatial data analysis. We’ve taken an end-to-end benchmark example involving distributed Extract-Transform-Load (ETL) on GeoTIFF and GeoJSON data, Multi-spectral geospatial images analysis, a pixel based object tagger training, and more.

If you like this blog and are eager to see more details on this GIS benchmark, you should have a look at this example repository. Remember that all these notebooks are in read-only mode, so you can only see them. If you want to play with your own example you gonna have to take the 5 min needed to install the Spark-Notebook on your machine.

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Identifying new shop implantation thanks to geo-data analysis

loic.quertenmont@gmail.com (Loic Quertenmont) — Wed, 08 Feb 2017 07:05:21 +0000

In this blog, we will see how we can perform geospatial data analysis in order to identify new business opportunities. For this showcase, we will focus on the retail sector and more precisely on the supermarket leading brands in Belgium: Colruyt, Delhaize, Carrefour, and Lidl. We analyzed the location of supermarkets in Brussels, computed the average time travel to the closest supermarket for Brussels neighborhood and see how these four major brands are sharing their market zone among Brussels neighborhood accordingly. We are reusing the techniques detailed in the Dynamic Web scrapping blog post. The techniques described in this post can be useful for all sorts of B2C companies involved in the retail sector, where competition is generally strong and shop implantation matters.

In a first step, we collect data of supermarket implantation that we can find on Belgian shop indexers (pages d'Or, foursquare, etc.) web pages. To do so, we repeat the procedure described in the Dynamic Web scrapping blog post. We collect the data from the 19 towns of Brussels. In order to extend the coverage on the west part of Brussels, we also consider the town of Dilbeek. We then cleaned up the list of shops that we extracted in order to consider only those from Colruyt, Delhaize, Carrefour, and Lidl brands. Shops that are too close (<10m) from each other are ignored in order to avoid counting twice the same shop. This can happen if a shop is shared among two towns, if the shop brand or name recently changed or if the indexing isn’t perfectly up to date. After the cross-cleaning, we identified 109 supermarkets in the Brussels area.

We can then translate the shop addresses into a list of GPS coordinates using free services like the geopy geolocator which is doing a pretty good job.

[sourcecode language="python"] #with d a dictionary containing shop information (and in particular the address) loc = geolocator.geocode(d["address"]) d["coord"] = [loc.latitude, loc.longitude] #after this, the distance between two shops, d and d2 can be computed as: dist = great_circle(d["coord"], d2["coord"]).meters [/sourcecode] We can compute the average latitude and longitude of the 109 shop coordinates that we collected. The average coordinates are used to create a map centered on the region of interest. We use the ipyleaflet widget to create interactive maps into a Jupyter notebook. Now that we have a map, we can display all the identified shops and associated to each of these a color representing the shop brand.

We can already see that the various brands have quite different implantation strategies in Brussels. Some of them might be due to historical reasons. We can see that there are many Carrefour shops and that those are generally located in streets in the middle of a neighborhood. Those are likely relatively small shops with a local history. On the contrary, we can see that Colruyt and Lidl are generally located close to main roads and axes. Delhaize strategy seems to be somewhere in between. Given these orthogonal strategies, it is not easy to identify which brand is dominating in a specific area.

In order to better visualize the influence of each brand in Brussels, we can use a grid to split the Brussels map into 150 x 75 small cells. For each cell, we compute what is the average time to travel from that cell to all identified shops. We can then pick-up the closest shop in travel time. We can use the Open Source Routing Machine (OSRM) service to compute the travel time (by car) between two points. There is a public server with a RESTful API that can do such computations for us. Actually, we can do the computation of several points at the same time by providing up to ~100 coordinates through the GET request. The API response takes the form of a JSON file containing, among other things, a time distance matrix to go from the source coordinates to the destination coordinates. Thanks to this (free) API, the task is therefore quite simple.

We are then just left to draw the cells onto the map with a color indicating the brand of the closest shop.

From this colored map, we can see that Carrefour is dominating the map. Nonetheless, the other brands also cover significant areas and they do so with much fewer shops. So they might actually be winning the fight of investment versus coverage.

More interestingly, we can now have a look at the map showing the travel time to the closest shop. Red color indicates small travel time, White color indicates a travel time larger than 5 minutes. The color shade in-between indicates the travel time somewhere between 0 and 5 minutes. This map is very instructive because it allows realizing the importance of the main axes into the implantation strategy of some brands. It also allows spotting Brussels area where all supermarkets require a ride of more than 5 minutes. Those areas are good candidates for new shop implantations as they would almost guarantee to redirect the local population to the new shop brand.

This is the end of this benchmark demo, but they are several other things that we could do with such geospatial data analysis. Just to cite few of them: we could, for instance, correlate the information to the local census data. This would help identify the area where the local density of population is high and where there are no other shops nearby. We could also correlate these areas with real estate data in order to find good location/price commercial surface for renting. We could take into account the size of the supermarket into the neighborhood sharing map. We could correlate supermarket locations with nearby other types of shops like fuel stations, car store, DIY store, etc. We could build new maps with new shop implantation hypothesis and see how much we could steal business to the competitors. Etc… As usual, possibilities are endless.

The code used for this benchmark demo is available in a notebook format on NBviewer. Unfortunately, it doesn't handle map properly, but it gives you access to the code to reproduce my analysis. Feel free to play with it.

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Customer Analytics, Segmentation and Churn study from Facebook data

loic.quertenmont@gmail.com (Loic Quertenmont) — Thu, 19 Jan 2017 07:05:21 +0000

In this blog, we will see how we can perform in-depth customer analytics using publicly available inputs from the customers on company Facebook pages. For this showcase, we will focus on the media sector and more precisely on the RTL group (leading TV & Radio on the French speaking side of Belgium). We analyzed the behavior of people acting on the Facebook pages of the RTL group and aggregated all available information to perform per-user analytics and predictions. We are reusing the techniques detailed in previously published blog posts on Facebook Mining and Sentiment Analysis. The techniques described in this post can be very useful for all major B2C companies involved in the media, telecoms, retails sectors.For this benchmark example, we will focus on two TV channels (RTL-TVI and Plug RTL) and two radio channels (Bel RTL and Radio Contact). All these channels are owned by the RTL group and target a specific audience. For instance, plug RTL channel is targeting your adults (age from 15 to 34) while Bel RTL is more focused toward senior people.

In a first step, we collect data for Facebook pages of these four channels. To do so, we repeat the procedure described in the Facebook Mining blog post for each of the channels. We collected two months of data in the period ranging from the 22nd of September 2016 to the 22nd of November 2016. We quickly see that there have been many posts from RTL over that period with about 100 likes per posts. Plug-RTL being the exception with a limited number of posts and likes in that period.

[gallery ids="1120,1121,1122,1123" columns="2" size="large"]

Now that we have the data, we can identify who likes each individual Facebook posts and build a collection of Facebook users interacting with these four RTL pages. With just two months of data, we collected about 125 thousand individual Facebook users. Among which about 10% have been acting on at least two of the four pages. When you think that the large majority of Facebook users are using their name as Facebook pseudo, it means that we could actually collect about 125K real people names with just a few lines of code. This is impressive. If you are scared... stop reading here, because we are actually going to do much more than this...

At this point, we have enough information to start building a per-user dashboard (one per individual users that we have identified). To do so, we need to keep track of what actions each user made on every single post. This adds a little bit of complexity to our "user" database, but actually not that much...

Many user-centric metrics can then be built out of this user-action-post dataset. A simple set of metrics that we can use is user activity over time and channel, overall user activity per channel, passive (likes) vs active (messages) user actions per channel. Bellow, you can see the user-dashboard result for one random (anonymized) Facebook user ("Dany") out of our 125K known users.

We can also construct more complex metrics by, for instance, analyzing the type of posts the user likes or comments on. We classified the posts in various topic categories and checked on which topics the user is performing actions. We subdivided Facebook posts in 9 categories related to channel programs, channel presenters, miscellaneous, music related, politic related, television series, weather related, movie related and humor related. Then, we counted the number of messages of a given type a specific user liked for each channel, and use that to build the DNA profile of a specific customer. Obviously, categories can be customized depending on your business needs and type of questions you are willing to answer. On the figure bellow, extracted from Dany's dashboard, the numbers in parenthesis correspond to the number of posts that Dany liked among the total number of RTL posts in that category over the considered period. The color bars represent the ratio of this two numbers (ranging from 0 to 1).

Now, it should be clear, that we can perform user segmentation or user 360° analysis by comparing Dany's DNA profile to the other users we identified. The simplest method would be to perform a K-means clustering in view of grouping users similar interests.

We can also analyze the messages posted on Facebook by the customers. The sentimental analysis algorithm that we developed in a previous post is particularly interesting as it allows us to spot negative messages from unsatisfied customers. Identifying such messages allow us to engage communication with these unsatisfied customers before they change channel/brand (churn prevention) or before they spread their unsatisfactoriness with other customers. This could potentially have a significant impact on the image of your company and on your communication strategy. The faster we react on the Facebook page, the better we preserve the image of the company.

The figure bellow shows the 5 latest reactions from Dany to posts on RTL Facebook channel pages. We can see that overall, Dany clearly likes the programs on RTL channels. He is a fully satisfied customer.

In comparison, we can take a look at the similar metric for another user. Anne commented 71 times over a period of just two months. We see that she is a much less satisfied user and many of her messages are actually questions that would require an answer from RTL side. Most of these questions are very easy to answer and would help to turn Anne into a satisfied user. Note that in the dashboard, we require a confidence level on the sentence polarity of more than 75%. Messages bellow this threshold are marked as "unclear" polarity.

They are several other things that we can do with such cross-channels Facebook data analysis. We could, for instance, prevent further churn by comparing the activity of the customers on competitor (RTBF, vivacité, etc.) Facebook pages. We could enhance user segmentation by looking at pages of other companies: We can identify the cooking lovers by checking their activity on leader cooking pages, etc. We could perform the study over a much longer time period in order to identify trends and provide feedback/suggestions for a better communication with the customers. We could cross Facebook user database with other databases in order to find the mail or physical address of the customers and possibly engage communication, sending discounts, etc. Possibilities are basically infinite...

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Sentiment Analysis of French texts using deep learning techniques

loic.quertenmont@gmail.com (Loic Quertenmont) — Mon, 19 Dec 2016 07:05:20 +0000

In this blog, we will see how deep learning techniques (Recurrent Neural Network, RNN and/or Convolutional Neural Network, CNN) can be used to determine the sentiment polarity of a written text. This is call "sentiment analysis" and it's very useful to enhanced the communication with your customers. Such algorithms are typically used to analyze emails, website or even Facebook posts where your customers may talk about your products. Thanks to this, you can prioritize your answers and react faster to the unsatisfied customers...The web is full of blogs explaining how to perform polarity analysis on English texts. Unfortunately, most of these tutorials are not very useful to analyze text in other languages or contexts. These blogs either relies on:

Classical Natural Language Processing (NLP) libraries trained by others for the sentiment analysis, like NLTK, Polyglot or the Standard NLP. And those are often limited to a specific list of languages and/or quickly become inefficient for specific (i.e. business) text context.
Deep learning techniques using always the same IMDB dataset of movie reviews in English. Actually, this dataset can not even be used for commercial application (see the license on the IMDB website) and it can therefore not be used for business purposes. Moreover, depending on the type of text you are planning to analyze, it's certainly better to take a dataset which is closer to your business context. For instance, a movie reviews dataset won't be ideal if you own a restaurant/hotel and you'd like to know if your customers are satisfied by your services.

For these reasons, I decided to show how easy we can train a deep neural network to learn the sentiment behind a text in French. The technical aspects of this document are inspired by this excellent blog. Like there, I am using Keras with a Tensorflow backend. The main difference is coming from the training dataset used: they used the usual IMDB dataset, while I use a French dataset of movie reviews that I mined myself (see this previous blog post). The idea is to train our model to "read" a (movie review) text and predict what is the mark that is associated with this text (score of the movie). If we have enough pairs of review text - score to train the deep neural network, the algorithm will be capable of understanding which sequence of words have a positive meaning and which ones have a negative meaning. As there is no need to know the vocabulary and/or grammatical rules of the language in this learning process, it can be used for any language in the condition to have a large enough dataset to train the model.

Before feeding the neural network, we need to perform the traditional NLP pre-processing of the text (I don't enter into the details as this is very classical approach and it's already well documented on every single NLP blog):

Texts are chunked in words based on white space, punctuation, etc. This is done using a Treebank tokenizer of the NLTK library (nltk.tokenize.word_tokenize).
Characters are turned to lowercase, as I don't want the algorithm to identify the word polarity based on its capitalization. (Although it could help sometimes).
A dictionary of all the words used in the dataset is built. The dictionary includes all words variants like verb conjugations. Punctuation marks are also considered as words as I want the algorithm to learn the sentiments behind an entire sentence. Among the top 25 most words used, we have: ',', de, et, le, la, un, à, film, les, est, qui,. en, que, une, des, pas, du, ce, dans, !, pour, mais, a, on. Those are mostly French "stopwords" that are present in almost every French text with comparable frequencies. These words do not contain much information. Nonetheless, as I want the algorithm to be language generic I am not removing these words from the dictionary as it's generally done because identifying stopwords in a foreign language might be challenging. On the other hands, the less used words are often typos, names or very rare words that will add very little to the text analysis. I, therefore, keep only the 10.000 most frequent words in the dataset dictionary and ignore all the other ones. Considering that an average native English speaker knows about 30.000 words and a second-language English speaker knows about 10.000 words, the model is expected to be somehow limited, but not totally ridiculous.
At this point, the review texts are turned to mathematical vectors. Words of the text are replaced by their index in the dictionary. Words that aren't present in the dictionary are simply skipped. In order to ease the processing, I fix the size of the vectors to 500 words. Text with more words are trimmed to the first 500 words and text with fewer words are padded with 0. After this, every single review text is represented by a 500 integer vector with components between 0 and 10.000 (the size of the dictionary). This is the input data for the deep neural network that we are going to use.

In a first simple model, I used the five following layers for the neural network architecture:

An embedding layer which encodes the word integer (comprise between 0 and 10.000) as a float vector of 64 components. I could have used specific embedding algorithm like word2Vec or glove to perform this task, but I prefer to let the network figure out what is the best embedding for this particular dictionary, problem, and objective function. This layer takes as input a 500-vector of integer and returns a 500 x 64 float matrix.
The following layer is a 1D-convolution layer made of 32 filters each with a length of 3 items. The goal of this layer is to learn the meaning of consecutive word patterns that may have a particular sense. Thinking about sentences like: "I did NOT like this movie". I expect this layer to catch that "NOT" preceding a verb actually negates the sense of the verb. As I use filters with lengths of maximum 3 words we should be able to catch that type of features at least in English and French (Dutch might be a bit more tricky as the distance between verb and negation can actually be large). This layer takes as input a 500 x 64 matrix and returns a 500 x 32 filter matrix. A Rectified Linear (ReLU) activation is used for this layer.
Convolution layers ar traditionally followed by pulling layers which allows keeping the number of parameters in the model under control. In this case, we simply reduce the matrix dimension by a factor of 2 by keeping the strongest "word" in every two "words" window. Word is in a quote as it's not really a representation of the word anymore after it passed through the convolution layer. This layer, therefore, returns a matrix of dimension 250 x 32 which is twice smaller than it's input.
It's time for the recurrent layer to come in. Here I used an LSTM layer, but I could also have used a GRU layer. I used 256 memory units for this layer in order to catch a maximum number of features within the entire text. A much lower number of smart neurons would certainly have worked as well since the length of the text is at most 500 words. The output of the layer is, therefore, a vector of 256 components.
Finally, I used a dense layer which would reconnect all the features together and predict a single value between 0 and 1 corresponding to the text polarity. 0 would mean a rather negative text and 1 a rather positive text. I used a sigmoid activation for this layer as I want its output to be between 0 and 1.

The table on the right summarizes the model architecture.

The model is trained on 80.000 reviews (20.000 are used for the model validation) using an ADAM optimizer and a binary cross-entropy objective function. Only reviews with a polarized score <2/5 or >4/5 are used for this training and they are marked as 0 or 1 polarity.

The total number of free parameters in the model is around 1000.000 and the training type is approximately 30 min per epoch on my Intel i5 laptop. After three epoch the accuracy is about 94% (tested on the validation sample).

Results examples on other movie reviews:

"Magnifique . Drôle . Touchant . Cruel . Moderne . Humain. Beau . Inventif . Avec un superbe trio d'acteurs ." leads to a sentiment polarity of 0.999 (highly positive review).
"Les personnages sont des paumés névrosés avec leur vision complètement négative de la vie. Ce film se veut original et provocateur à travers la vulgarité et la méchanceté gratuites. A part les premières scènes acceptables, on ne regarde malheureusement la suite qu'en apnée. Déplorable !" leads to a sentiment polarity of 0.048 (highly negative review).

Results examples on other sentences:

"Je n'ai vraiment pas aimé ce film. Les acteurs sont mauvais et l'histoire est particulièrement nulle." 0.005 (negative sentiment).
"J'ai vraiment aimé ce film. Les acteurs sont excellents et l'histoire est originale." 0.998 (positive sentiment)
"Excellent" 0.942 (positive sentiment)
"Ca ne s'annonce pas bon" 0.461 (average sentiment)
"Vraiment étonnant" 0.854029 (rather positive sentiment)
"Un peu de gaieté et de plaisirs pour ce soir 😀" 0.445237 (rather negative sentiment).

We see that although it's not always perfect, the algorithm catches most of the time the sentiment behind a sentence/text. Moreover, when it fails the text polarity is close to 0.5 which indicates some confusion regarding the sentiment. But for such scores, it is not possible to know if the algorithm is confused or if the text is neutral.

We can actually improve a bit the model using multiple outputs and a softmax in the last layer. In other words, we can train a model that would give us the probability that a text is negative (category 1), neutral (category 2) or positive (category 3). With this type of output, we can also measure the confidence of the algorithm regarding its prediction.

This more advanced model is completely identical to the previous model (see figure) with the exception of the last layer which outputs a 3-float vector and uses a softmax activation (instead of a sigmoid) in order to guarantee a probability of being in one of the three categories. The objective function used for the training is this time a categorical cross-entropy.

Of course, this time we also included neutral reviews in the training phase which we write as a (0,1,0) score vector. (1,0,0) and (0,0,1) being used as score vector for the negative and neutral reviews, respectively. All the rest remains unchanged. The number of free parameters in the model is almost unchanged (still around 1.000.000) as the last layer only accounts for a tiny fraction of the model parameters. The training time is slightly larger than for the first model: ~40 min per epoch. But it is still quite reasonable. The model accuracy evaluated on the validation dataset is about 96%. The results for the same sentences as in the previous case are:

Results examples of other sentences:

"Je n'ai vraiment pas aimé ce film. Les acteurs sont mauvais et l'histoire est particulièrement nulle." [ 0.97207683 0.02485518 0.00306799](negative sentiment with 97% probability).
"J'ai vraiment aimé ce film. Les acteurs sont excellents et l'histoire est originale." [ 0.00260297 0.06905617 0.92834091] (positive sentiment with 93% probability)
"Excellent" [ 0.03898903 0.09006314 0.87094784] (positive sentiment with 87% probability)
"Ca ne s'annonce pas bon" [ 0.76711851 0.15689771 0.07598375] (negative sentiment with 76% probability)
"Vraiment étonnant" [0.05729499 0.11347752 0.82922751] (positive sentiment with 83% probability)
"Un peu de gaieté et de plaisirs pour ce soir 😀" [ 0.40064782 0.21832566 0.38102651] (mitigated sentiment).

We can see that the last sentence is clearly confusing the algorithm which isn't capable of identifying a clear sentiment behind this sentence. But this time, we can spot that all the algorithm predictions are under a 50% probability which is therefore rather weak.

The model can, of course, be further customized and improved, but I am stopping here for this tutorial. For the most curious one, here is the code I used to define and train the advanced model. As you can see, there is nothing really striking there.

from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
 
...
 # create the model
 embedding_vecor_length = 64
 model = Sequential()
 model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
 model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
 model.add(MaxPooling1D(pool_length=2))
 model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2))
 model.add(Dense(3, activation='softmax'))
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 print(model.summary())
 model.fit(Sequences[:split], Labels[:split], nb_epoch=3, batch_size=64)
 # Final evaluation of the model
 scores = model.evaluate(Sequences[split:], Labels[split:], verbose=0)
 print("Accuracy: %.2f%%" % (scores[1]*100))

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Scrapping social data from Facebook

loic.quertenmont@gmail.com (Loic Quertenmont) — Thu, 15 Dec 2016 07:05:19 +0000

Nowadays, social networks can be considered as a main source of data. This is particularly true for business to customer companies which must take into account customer feedback on their products. In this blog, we will show how to retrieve information from Facebook using the Facebook Graph API...

As a showcase, we will retrieve the activity over the last 6 months on the Facebook page of the Belgian Railway company (SNCB / NMBS). To do so, we will use the Facebook SDK package for python which provides handy functions to interact with the Graph API.

The first thing that we need is an identification token to connect to the Graph API. The Facebook token allows us to give specific permission to a given application. For instance, we can allow it to see our list of friends, our email address etc. Facebook is actually quite strict regarding the protection of user data. For what we will do today, we don't need any of these permissions as we are not trying to get information from a particular user but instead we want to access a public Facebook page and extract from it as much public information as possible. The token can be obtained via OAuth2 identification but in order to keep this blog simple, we will request a token using the Facebook Explorer interface. On this page, you need to press the "get token" button, and chose "get user token". A popup window opens where you can choose which permissions you want to associate to this token. We don't need any permission for what we plan to do, so no need to check any of the boxes. Be aware that the user token has a limited time validity of about 2 hours. So you may have to request a new token from time to time.

Now that we have a token, we can make a query on the Facebook API using the python SDK. The code bellow shows how to parse all posts on the Facebook page of the SNCB (Belgian railway). This is done via the "getConnections" function which takes as argument a Facebook object Id (here the Id of the SNCB page on Facebook) and the type of contents we want to grab (here the posts on the page). As the number of post on a page can be a pretty large number, Facebook graph API only returns the first posts. But, it also returns a pointer (via a "paging" object) to get the following bunch of posts, so if we want to process all posts of a page we will need to process them bunch by bunch until there is no following bunch available. See the graph API documentation for more details.

import facebook
#set token we received from Facebook explorer (the one bellow is out dated)
access_token = 'EAACEdEose0cBAPWTcaMppNFceRnsORWCFSiaaQD8Gr7UZArgl7xZBuucoUz96g3QmmP2tZCgR2DAlLl4sxmnmlabArULdZBGsqM7KcUHzlLZBJvRH6FWnVBfeYt7bAW5fZAWZCZALQnE0BhRQxraAUKW7ec0H6cwzL7GwkKGXZA435gZDZD'
public_page = 'sncb'
 
graph = facebook.GraphAPI(access_token)
pageFB = getObject(public_page)
posts = getConnections(pageFB['id'], connection_name='posts', summary='true')
 
while(True):
   for p in posts['data']:
      #we can process a facebook post  (for this example, we just print it)
      print(p)
 
   if( ('paging' not in posts) or ('next' not in posts['paging'])):
      #we have processed all posts, exit the while loop
      break
   else:
      #there are more post to grab, so get a new bunch of post
      posts = requests.get(posts['paging']['next']).json()

As you can see, the amount of information contained in one single post is quite impressive. You can note that the information is saved in the form of a python dictionary which is particularly convenient to access specific fields. See what I get for the first post:

story_tags : {'0': [{'type': 'page', 'id': '484217188294962', 'name': 'SNCB', 'offset': 0, 'length': 4}]}
from : {'id': '484217188294962', 'name': 'SNCB', 'category': 'Transportation Service', 'category_list': [{'id': '152367304818850', 'name': 'Transportation Service'}, {'id': '2258', 'name': 'Travel Company'}]}
link : https://www.facebook.com/SNCB/photos/a.844798168903527.1073741830.484217188294962/1157476480969026/?type=3
is_expired : False
updated_time : 2016-12-15T07:53:27+0000
actions : [{'name': 'Comment', 'link': 'https://www.facebook.com/484217188294962/posts/1157476480969026'}, {'name': 'Like', 'link': 'https://www.facebook.com/484217188294962/posts/1157476480969026'}]
icon : https://www.facebook.com/images/icons/photo.gif
is_hidden : False
message : "Les accompagnateurs de train Anneleen et Bart sont fiancés ! <3\nLeurs regards se sont croisés pour la première fois dans le centre de formation où leur carrière a débuté en 2011. Ce n’est qu’un an plus tard qu’a jailli l’étincelle, lors de leur premier rendez-vous. La gare de Louvain a récemment été le théâtre de la demande en mariage.\nNous leur souhaitons beaucoup de bonheur ensemble !"
object_id : 1157476480969026
shares : {'count': 75}
likes : {'paging': ... TRUNCATED ...}
privacy : {'friends': '', 'deny': '', 'allow': '', 'description': '', 'value': ''}
created_time : 2016-12-14T10:40:21+0000
type : photo
name : Timeline Photos
id : 484217188294962_1157476480969026
status_type : added_photos
story : SNCB feeling in love.
picture : https://scontent.xx.fbcdn.net/v/t1.0-0/p130x130/15492137_1157476480969026_4126379523377925813_n.jpg?oh=a99414cf126f04e12de4a006a435fc56&oe=58BA1CD4
comments : {'paging': .... TRUNCATED ... }

Among other things, we have the post message, creation time, the name of the author, type of post, a story description of the post, a link to the post picture and pointers to all the likes and comments (including who liked/commented on this post). That's a gigantic source of information for analytics. With this, we can identify who likes on this page (and is, therefore, concerned about Belgian Railway company), we can analyze the comments made on a post and possibly trigger unsatisfied customer message (and take action to improve the situation), we can do user segmentation based on the user profiles, etc.

As a simple benchmark for this post, we will analyze the ~150 posts that were published on the SNCB page of the last 6 months. We can look first at the post popularity in terms of likes, shares, and comments:

For the 15 most popular posts we show the post name. We can notice that for most of them, the post name is "Timeline Photos" which is a name generated by Facebook when a new picture is uploaded. We can therefore already notice that "picture" post are more popular than text or news post. This is precious information for the communication strategy of the company. We can continue by analyzing the text of the post in order to identify the topic of interest on SNCB page. To do so I used the simple entity extraction of the NLTK python library and plot them as a word cloud (word size depends on frequency):

As expected, Mobility, railway stations, and special exhibitions are the main center of interest. We don't learn anything very striking here, but this is just an example. What is more interesting to do is to analyze the text of the user reactions to SNCB posts (but we will not do it here). We can continue our simple page analytics by finding who are the major post likers for this page. As most of the Facebook users are actually using their real name as Facebook pseudo, this analysis is particularly interesting as it allows you to identify (most of the time) real people who are the defender of your products/brand. And eventually to identify the ones who have issues with your products. That's again very useful information as it gives you a chance to engage communication with them and solve the problem they may have and eventually prevent churn. Bellow are two figures showing the best fan of the SNCB page in terms of a #likes/user distribution and on the form of a word cloud.

We can perform many more analyses based on Facebook data, but I will stop here for this blog. Another one with more complex (and interesting) analytics will be released soon...

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Scrapping land invest data from dynamic web

loic.quertenmont@gmail.com (Loic Quertenmont) — Wed, 14 Dec 2016 07:05:18 +0000

In a previous blog post, we have seen how to mine information on static web pages. In this blog post, I'll explain how we can do the same on dynamically (i.e. javascript) generated web pages. As a showcase, I will show you how to find the best land investment you can make in Belgium today... We all know many websites of classified advertising for houses and lands selling. Those sites contain a large amount of interesting information for what concerns land invest. Unfortunately, most of the time, those websites are heavily using javascript to dynamically generate the web page based on the client information (i.e. language, type of web browser, screen size, geographical position, cached data, etc.). If we try to mine these sites using the techniques detailed in the previous blog, the only thing that we will get is actually a small part of the javascript code that is executed when the page is opened in a real browser. This makes the data harvesting particularly complicated.

Fortunately, they are useful python libraries that can be used to solve this problem. My favorite is Selenium which could be used to open a real web browser (i.e. Firefox or chrome) and automatize the used behavior like clicking on a link, filling a form, pressing a button, etc. Many selenium tutorials can be found on Guru99. The only thing that we have to do is to inspect the page (as explained in the previous blog) in order to identify the name of the elements of interest on the web page and tell selenium the sequence of action we want him to do on a page. Actually, our work is even more simplified by additional web browser plugins like Selenium IDE which could be integrated into your Firefox browser to record (and later export) all the actions you make on a web page. This allows to very quickly automatize repetitive behavior on the web.

Below is a small example of the selenium capabilities. In this demo, we simply open a Firefox web browser on the google page and search for the term "selenium".

import os,time
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
 
#initialize the web browser
os.environ["PATH"] += ":/PathTo/geckodriver" #needed for recent firefox versions
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
driver = webdriver.Firefox(capabilities=firefox_capabilities)
driver.implicitly_wait(10) # timeout for page to load
 
#go to google.com
driver.get("http://www.google.com")
 
#find the searchbox and put "selenium" text in it
driver.find_element_by_id("lst-ib").send_keys("selenium")
 
#wait a little bit for the text to be sent
time.sleep(1)
 
#press the search button  (this will move us to the results page)
driver.find_element_by_id("lst-ib").send_keys(Keys.RETURN)
 
#convert the results page to a beautifulSoup object and print it
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.body.get_text(" "))
 
#wait 1min before closing the window (this is just for the demo)
time.sleep(60)
driver.quit()

So, this is it for the technical aspect of this blog post, we can now move to the logic of today's showcase. Again, I am not going to give specific code I used in order to preserve the server of the classified advertising website I've used. However, I can speak about the logic of the algorithm and about the results I've got. For collecting the data about the land selling market in Belgium, here are the typical actions we want to perform:

Search for land for building opportunities in a Belgian city (based on a zip code)
1. Find the search form on the page
2. Select "land for building" as type of good
3. Fill the zip code in the search box
4. Press the search button
Wait for the results page to load
Parse the result pages
1. Convert the page to a BeautifulSoup object (as in the previous blog post) and iterate on all the element of interest we want to gather
2. Search for a "next page" link and click the link if it exist
3. Go to step 3.1. and iterate until all results page even been downloaded
Go to step 1. and iterate with another zip code

As you can see the logic is rather simple and thanks to Selenium IDE, all these actions can be recorded in a couple of minutes. Then the only thing to do is to integrate all this in a python loop on city zip codes we want to analyze. I processed all post in all Belgian cities during the month of October and collected for each the description of the good, the surface of the land, the selling price, the zip code and the town. It took less than two hours to collect all the data, corresponding to approximately 10.000 posts. You can find bellow some figures made out of these data.

The number of posts collected per Belgian city zone (white indicate no post found).

Average surface (in m²) of the lands for buildings being sold in each city zone.

Average price (in euro) of the lands for buildings being sold in each city zone.

The average price per surface (€/m²) of the lands for buildings being sold in each city zone.

The last figure can be compared to the official figure made by the Belgian government in 2014. In can be seen that despite we used data from only one data harvesting in October 2016, the two figures are very similar and we can reproduce all the trends that are observed on the official figure: high price in Brussels and on the Belgian coast. The scale of the price is also quite comparable. We can also notice that the size of the lands being sold is much larger in Wallonia compare to Flanders, but the price/m² is also much lower.

Now that we have meaningful data, we can start looking for the best investment. To do so, we will look for the land which as a price/m² significantly lower than the average for its town. In order to accommodate with the limited statistics we have, we will only consider towns for which we have at least 5 offers (in order to have a reasonable error on the average). Bellow is the list of the 25 best investments you can make according to the average of the price in the town. In the top 10, nine goods are located in Flanders which is certainly meaning something...

Land of  1614 m² to sell at 215000 € (133.21 €/m²) in 1860 meise (average for the town is 352.58+- 66.56 €/m²)
Land of  1445 m² to sell at 125000 € ( 86.51 €/m²) in 3950 bocholt (average for the town is 152.01+- 20.83 €/m²)
Land of 15950 m² to sell at 595000 € ( 37.30 €/m²) in 2550 kontich (average for the town is 524.20+-155.83 €/m²)
Land of  3326 m² to sell at 144000 € ( 43.30 €/m²) in 3320 hoegaarden (average for the town is 232.71+- 61.52 €/m²)
Land of  2013 m² to sell at 107000 € ( 53.15 €/m²) in 3560 lummen (average for the town is 173.30+- 39.51 €/m²)
Land of  2487 m² to sell at 175000 € ( 70.37 €/m²) in 3520 zonhoven (average for the town is 179.41+- 36.29 €/m²)
Land of  1810 m² to sell at 135000 € ( 74.59 €/m²) in 3990 peer (average for the town is 177.42+- 34.83 €/m²)
Land of  2680 m² to sell at  90000 € ( 33.58 €/m²) in 3970 bourg-leopold (average for the town is 167.06+- 46.53 €/m²)
Land of  1386 m² to sell at 225000 € (162.34 €/m²) in 1860 meise (average for the town is 352.58+- 66.56 €/m²)
Land of  3050 m² to sell at  57000 € ( 18.69 €/m²) in 5350 ohey (average for the town is 45.65+- 9.70 €/m²)
Land of 10491 m² to sell at  32000 € (  3.05 €/m²) in 6640 vaux-sur-sure (average for the town is 42.92+- 14.34 €/m²)
Land of 14201 m² to sell at  90000 € (  6.34 €/m²) in 6860 leglise (average for the town is 56.48+- 18.47 €/m²)
Land of  7000 m² to sell at 165000 € ( 23.57 €/m²) in 1370 jodoigne-souveraine (average for the town is 85.28+- 23.38 €/m²)
Land of  2156 m² to sell at 312000 € (144.71 €/m²) in 2870 breendonk (average for the town is 269.68+- 47.35 €/m²)
Land of  1521 m² to sell at 128200 € ( 84.29 €/m²) in 3670 meeuwen-gruitrode (average for the town is 228.11+- 54.78 €/m²)
Land of  6858 m² to sell at 399000 € ( 58.18 €/m²) in 2310 rijkevorsel (average for the town is 250.75+- 74.73 €/m²)
Land of  6000 m² to sell at 280000 € ( 46.67 €/m²) in 1570 gammerages (average for the town is 221.27+- 68.20 €/m²)
Land of  1858 m² to sell at 275000 € (148.01 €/m²) in 1780 wemmel (average for the town is 423.63+-110.28 €/m²)
Land of 12370 m² to sell at 149000 € ( 12.05 €/m²) in 6470 sivry-rance (average for the town is 37.70+- 10.32 €/m²)
Land of  1371 m² to sell at 125000 € ( 91.17 €/m²) in 3990 peer (average for the town is 177.42+- 34.83 €/m²)
Land of 12000 m² to sell at 100000 € (  8.33 €/m²) in 5377 somme-leuze (average for the town is 46.75+- 15.63 €/m²)
Land of 13860 m² to sell at  35000 € (  2.53 €/m²) in 4190 ferrieres (average for the town is 45.60+- 17.78 €/m²)
Land of   770 m² to sell at  30000 € ( 38.96 €/m²) in 2235 hulshout (average for the town is 227.01+- 78.15 €/m²)
Land of  7700 m² to sell at  84000 € ( 10.91 €/m²) in 6670 gouvy (average for the town is 39.97+- 12.15 €/m²)

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!

Scrapping movie data from static web

loic.quertenmont@gmail.com (Loic Quertenmont) — Tue, 13 Dec 2016 07:05:18 +0000

Every data science journey starts by aggregating the data of interest. In the industry sector, those are often coming directly from sensors, user surveys, software or application used by your customers. Nonetheless, the information publicly available on the web still remain an important source of additional information like news, weather or even geographical addresses. Today, we will focus on movie data...

In this post, I will present some techniques to scrap static web pages in python 3. In this context, static mean pages for which the information is not dynamically generated by the web browser (using i.e. javascript functions) but are instead generated on the server-side and don't require your browser to dynamically generate information. Dynamic website scrapping will be addressed in a different post.

For this showcase, we will scrap a well-known french website of movie reviews. For all accessible movies, we will collect the movie director, the main actors, the movie title, and synopsis and some additional information like the release date, the duration of the movie, the type of movie (drama, comedy...). In addition, we will also collect user reviews (score and critics) for each movie. The name of the movie review website that is used is not mentioned on purpose in order to avoid massive load on their servers. Nevertheless, the techniques described here can be used on your favorite website.

In python 3, the easiest way to access the content of a web page is via the urllib.request package. See bellow a typical example to scrap the google page. Note that we provide the "User-Agent" Http header in the Request constructor because many websites only allow access to their content to well-known web browser ( Google Chrome in this case).

from urllib.request import urlopen, Request
url = "http://www.google.com"
try:
   req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'})
   pageHtml = urllib.request.urlopen(req, timeout=5)
   print(pageHtml.read()) #print the text of the page
   pageHtml.close()
except Exception as e:
   #handle for timeout, wrong URL, wrong permission, etc.
   print("Exception with url " + str(url)+" at line " + str(sys.exc_info()[-1].tb_lineno) + "\n" +str(e))

If you try to run the above code, you will see that what we receive is nothing but the Html code of the google page. That's all we need as the information we are looking for is hidden somewhere inside there. Many packages are available in python to help us navigate within the Html code and help us to collect the information we need. The most convenient (and famous) to use is BeautifulSoup. We can build a BeautifulSoup object from our web page using the following code:

from bs4 import BeautifulSoup
page = BeautifulSoup( pageHtml , "lxml")
print(page.body.get_text(" ", strip=True))  #print all the text found on the webpage

What is particularly interesting with BeautifulSoup is that we can easily search for all Html object on the page with a given class name. In modern web design, class name is very often used to categorize the different object displayed on a page. In the case that interest us, movie review website are often constructed as a table of movie where each row contains the information we are interested in. If we continue with the google example, we can easily retrieve the search text entry by inspecting the web page in a web browser (I used chrome). This is made easy by right clicking on the element for which we can to retrieve the class name and selecting "inspect" in the contextual menu. The code of the element is highlighted on the right inspection tab. The class name is generally one of the first entry of the block.

Retrieving the class name of an object on a page is very simple thanks to the "inspect" option of modern web browsers.[/caption]

We can then access this element from our code by searching for all elements which have a class name "gsfi" using BeautifulSoup find_all method. It returns a list of found object.

searchEntries = page.body.find_all(class_="gsfi", recursive=True)
if(len(searchEntries)==0):  print("Object was not found on page")

On more complex web page (i.e. movie review sites) we can then loop on all the object found in order to extract the information we are looking for using the get_text method.

for entry in searchEntries:
    print("Found an entry with text" + entry.get_text(" ", strip=True))

We now have all the ingredients to scrap any type of static web page. So it's probably a good time to remind that they are some good practice to follow when we scrap a website. Many information and details are available on scraphero, but for me, the more important advices are:

Don't go too fast: If you try to load too many pages at the same time, you have good chances to saturate the server and to be banned forever from the website.
Follow the rules provided by the robot.txt file of the website. This file indicates what is allowed in terms of scrapping on their website.

We can now proceed and collect our movie data...

As explained earlier, I am not going to give specific code in order to preserve the movie website server. However, I can show some of the information collected and some of the fun we can have with them. Bellows are three example of extracted movies out of the 70000 movies (~1GB of data) I collected.

Elysium: Synopsis: En 2154, il existe deux catégories de personnes : ceux très riches, qui vivent sur la parfaite station spatiale crée par les hommes appelée Elysium, et les autres, ceux qui vivent sur la Terre devenue surpeuplée et ruinée. Director: Neill Blomkamp Actors: Matt Damon , Jodie Foster , Sharlto Copley Extra data: 14 août 2013 / 1h 50min / Science fiction , Action , Thriller

Le Père Noël est une ordure: Synopsis: La permanence téléphonique parisienne SOS détresse-amitié est perturbée le soir de Noël par l'arrivée de personnages marginaux farfelus qui provoquent des catastrophes en chaîne. Director: Jean-Marie Poiré Actors: Anémone , Josiane Balasko , Marie-Anne Chazel Extra data: 25 août 1982 / 1h 23min / Comédie

Expendables 3: Synopsis: Barney, Christmas et le reste de l’équipe affrontent Conrad Stonebanks, qui fut autrefois le fondateur des Expendables avec Barney. Stonebanks devint par la suite un redoutable trafiquant d’armes, que Barney fut obligé d’abattre… Du moins, c’est ce qu’il croyait. Director: Patrick Hughes (II) Actors: Sylvester Stallone , Jason Statham , Arnold Schwarzenegger Extra data: 20 août 2014 / 2h 07min / Action

As you can see the movie details are quite complete. I have not listed there the user reviews associated with these movies are they are many reviews for each move and that those are often very lengthy. Nonetheless, we have the data and can use them if needed.

We can now perform some analysis on these data in order to find insights,

Who are the directors who have made more movie?

John Ford : 84 movies
Raoul Ruiz : 66 movies
Kenji Mizoguchi : 63 movies
Henry Hathaway : 57 movies
Jesús Franco : 56 movies
Julien Duvivier : 54 movies
Claude Chabrol : 53 movies
Jean-Pierre Mocky : 52 movies
Seijun Suzuki : 52 movies
Raoul Walsh : 51 movies

Who are the actors who have played in more movie ?

Gérard Depardieu : 116 movies
Michel Piccoli : 107 movies
Catherine Deneuve : 102 movies
Robert De Niro : 90 movies
Bernard Blier : 89 movies
Jean Gabin : 88 movies
Michel Serrault : 86 movies
Fernandel : 84 movies
Jean-Louis Trintignant : 83 movies
Jeanne Moreau : 82 movies

Of course, we can see that french actors and directors appear in this top 10. This might be a bias induced by the "french" origin of the movie review site that we have scrapped. There is also a "temporal" bias as we considered all movies without time window. Making similar top 10 considering only movies produced in the last decade would lead to very different results.

What are all the movies where Tom Hanks played ? (in random order)

The Circle (de James Ponsoldt)
Les Monstres du labyrinthe (de Steven Hilliard Stern)
Extrêmement fort et incroyablement près (de Stephen Daldry)
Mister Showman (de Sean McGinly)
Greyhound (de Aaron Schneider)
Dans l'ombre de Mary - La promesse de Walt Disney (de John Lee Hancock)
Il n'est jamais trop tard (de Tom Hanks)
Turner & Hooch (de Roger Spottiswoode)
Les Sentiers de la perdition (de Sam Mendes)
Le Palace en folie (de Neal Israel)
Misery Loves Comedy (de Kevin Pollak)
Nuits blanches à Seattle (de Nora Ephron)
Vous avez un message (de Nora Ephron)
Capitaine Phillips (de Paul Greengrass)
Every time we say goodbye (de Moshe Mizrahi)
Les Banlieusards (de Joe Dante)
Philadelphia (de Jonathan Demme)
Toujours prêts (de Nicholas Meyer)
Cloud Atlas (de Lilly Wachowski)
La Guerre (de Lynn Novick)
Sully (de Clint Eastwood)
Joe contre le volcan (de John Patrick Shanley)
Une Équipe hors du commun (de Penny Marshall)
Big (de Penny Marshall)
La Guerre selon Charlie Wilson (de Mike Nichols)
Le Mot de la fin (de David Seltzer)
Rien en commun (de Garry Marshall)
And the Oscar goes to (de Jeffrey Friedman)
Splash (de Ron Howard)
Inferno (de Ron Howard)
Anges et démons (de Ron Howard)
Da Vinci Code (de Ron Howard)
Apollo 13 (de Ron Howard)
Dragnet (de Tom Mankiewicz)
Le Bûcher des vanités (de Brian De Palma)
Ithaca (de Meg Ryan)
Il faut sauver le soldat Ryan (de Steven Spielberg)
Arrête-moi si tu peux (de Steven Spielberg)
Le Pont des Espions (de Steven Spielberg)
Le Terminal (de Steven Spielberg)
L'Homme à la chaussure rouge (de Stan Dragoti)
Le Pôle Express (de Robert Zemeckis)
Seul au monde (de Robert Zemeckis)
Forrest Gump (de Robert Zemeckis)
La Ligne verte (de Frank Darabont)
A Hologram for the King (de Tom Tykwer)
Une Baraque à tout casser (de Richard Benjamin)
Ladykillers (de Ethan Coen)

Who are the favorite actors of Steven Spielberg, Ron Howard and Robert Zemeckis ? and how many films do they have in common?

The figure bellow shows a graph network of the actor shared between these three directors. For clarity of the figures, only the actors that are in the top 500 actors are shown. Green dots symbolizes actors, red dots symbolizes directors and each line correspond to a movie which connects a director and an actor. When more than one line connects a director to an actor, it means that there are several movies connecting them.

Have you already faced similar type of issues ? Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!