Liip Blog https://www.liip.ch/en/blog Kirby Wed, 25 Apr 2018 00:00:00 +0200 Latest articles from the Liip Blog en Tensorflow and TFlearn or can deep learning predict if DiCaprio could have survived the Titanic? http://feeds.liip.ch/~r/liip_blog/~3/wMLz3H-a1CI/tensorflow-and-tflearn-or-can-deep-learning-predict-if-dicaprio-could-have-survived-the-titanic https://www.liip.ch/en/blog/tensorflow-and-tflearn-or-can-deep-learning-predict-if-dicaprio-could-have-survived-the-titanic Wed, 25 Apr 2018 00:00:00 +0200 <p>Getting your foot into deep learning might feel weird, since there is so much going on at the same time. </p> <ul> <li>First, there are myriads of frameworks like <a href="http://tensorflow.org">tensorflow</a>, <a href="https://caffe2.ai">caffe2</a>, <a href="http://torch.ch">torch</a>, <a href="http://www.deeplearning.net/software/theano/">theano</a> and <a href="https://github.com/Microsoft/cntk">Microsofts open source deep learning toolkit CNTK</a>. </li> <li>Second, there are dozens of different ideas how networks can be put to work e.g <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent neural networks</a>, <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">long short-term memory networks</a>, <a href="https://de.wikipedia.org/wiki/Generative_Adversarial_Networks">generative adversarial networks</a>, <a href="https://de.wikipedia.org/wiki/Convolutional_Neural_Network">convolutional neural networks</a>. </li> <li>And then finally there are even more frameworks on top of these frameworks such as <a href="https://keras.io">keras</a>, <a href="http://tflearn.org">tflearn</a>. </li> </ul> <p>In this blogpost I thought I'd just take the subjectively two most popular choices Tensorflow and Tflearn and show you how they work together. We won't put much emphasis on the network layout, its going to be a plain vanilla two hidden layers fully connected network. </p> <p>Tensorflow is the low-level library for deep learning. If you want you could just use this library, but then you need to write way more boilerplate code. Since I am not such a big fan of boilerplate (hi java) we are not going to do this. Instad we will use Tflearn. Tflearn used to be an own opensource library that provides an abstraction on top of tensorflow. Last year Google integrated that project very densely with Tensorflow to make the learning curve less steep and the handling more convenient. </p> <p>I will use both to predict the survival rate in a commonly known data set called the <a href="https://www.kaggle.com/c/titanic">Titanic Dataset</a>. Beyond this dataset there are of course <a href="http://scikit-learn.org/stable/datasets/index.html">myriads</a> of such classical sets of which the most popular is the <a href="https://www.kaggle.com/uciml/iris">Iris dataset</a>. It handy to know these datasets, since a lot of tutorials are built around those sets, so when you are trying to figure out how something work you can google for those and the method you are trying to apply. A bit like the hello world of programming. Finally since these sets are well studied you can try the methods shown in the blogposts on other datasets and compare your results with others. But let’s focus on our Titanic dataset first.</p> <h3>Goal: Predict survivors on the Titanic</h3> <p>Being the most famous shipwreck in history, the Titanic sank after colliding with an iceberg on 15.04 in the year 1912. From all the 2224 passengers almost 1502 died, because there were not enough lifeboats for the passengers and the crew. </p> <img src="https://liip.rokka.io/www_inarticle/781d2e/titanic-infographic-11.jpg" alt="Titanic"> <p>Now we could say it was shear luck who survived and who sank, but we could be also a be more provocative and say that some groups of people were more likely to survive than others, such as women, children or ... the upper-class. </p> <p>Now making such crazy assumptions about the upper class is not worth a dime, if we cannot back it up with data. In our case instead of doing boring descriptive statistics, we will train a machine learning model with Tensorflow and Tflearn that will predict survival rates for Leo DiCaprio and Kate Winslet for us. </p> <h3>Step 0: Prerequisites</h3> <p>To follow along in this tutorial you will obviously need the titanic data set (which can be automatically downloaded by Tflearn) and both a working Tensorflow and Tflearn installation. <a href="http://tflearn.org/installation/">Here</a> is a good tutorial how to install both. Here is a quick recipe on how to install both on the mac, although it surely will run out of date soon (e.g. new versions etc..):</p> <pre><code class="language-bash">sudo pip3 install https://ci.tensorflow.org/view/tf-nightly/job/tf-nightly-mac/TF_BUILD_IS_OPT\=OPT,TF_BUILD_IS_PIP\=PIP,TF_BUILD_PYTHON_VERSION\=PYTHON3,label\=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tf_nightly-1.head-py3-none-any.whl</code></pre> <p>If you happen to have a macbook with an NVIDIA graphic card, you can also install Tensorflow with GPU support (your computations will run parallel the graphic card CPU which is much faster). Before attempting this, please check your graphics card in your &quot;About this mac&quot; first. The chances that your macbook has one are slim.</p> <pre><code class="language-bash">sudo pip3 install https://storage.googleapis.com/tensorflow/mac/gpu/tensorflow_gpu-1.1.0-py2-none-any.whl</code></pre> <p>Finally install TFlearn - in this case the bleeding edge version:</p> <pre><code class="language-bash">sudo pip3 install git+https://github.com/tflearn/tflearn.git</code></pre> <p>If you are having problems with the install <a href="https://www.tensorflow.org/install/install_mac#CommonInstallationProblems">here</a> is a good troubleshooting page to sort you out. </p> <p>To get started in an Ipython notebook or a python file we need to load all the necessary libraries first. We will use numpy and pandas to make our life a bit easier and sklearn to split our dataset into a train and test set. Finally we will also obviously need tflearn and the datasets. </p> <pre><code class="language-python">#import libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import tflearn from tflearn.data_utils import load_csv from tflearn.datasets import titanic</code></pre> <h3>Step 1: Load the data</h3> <p>The Titanic dataset is stored in a CSV file. Since this toy dataset comes with TFLearn, we can use the TFLearn load_csv() function to load the data from the CSV file into a python list. By specifying the 'target_column' we indicate that the labels - so the thing we try to predict - (survived or not) are located in the first column. We then store our data in a pandas dataframe to easier inspect it (e.g. df.head()), and then split it into a train and test dataset. </p> <pre><code class="language-python"># Download the Titanic dataset titanic.download_dataset('titanic_dataset.csv') # Load CSV file, indicate that the first column represents labels data, labels = load_csv('titanic_dataset.csv', target_column=0, categorical_labels=True, n_classes=2) # Make a df out of it for convenience df = pd.DataFrame(data) # Do a test / train split X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.33, random_state=42) X_train.head()</code></pre> <img src="https://liip.rokka.io/www_inarticle/703967/dataset.png" alt="dataframe" format="png"> <p>Studying the data frame you also see that we have a couple of infos for each passenger. In this case I took a look at the first entry:</p> <ul> <li>Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)</li> <li>name (e.g. Allen, Miss. Elisabeth Walton)</li> <li>gender (e.g. female/male)</li> <li>age(e.g. 29)</li> <li>number of siblings/spouses aboard (e.g. 0)</li> <li>number of parents/children aboard (e.g. 0)</li> <li>ticket number (e.g. 24160) and</li> <li>passenger fare (e.g. 211.3375)</li> </ul> <h3>Step 2: Transform</h3> <p>As we expect that the ticket number is a string we can transcode it into a category. But since we don’t know which Ticket number Leo and Kate had, let’s just remove it as a feature. Similarly the name of a passenger in its form as a simple string is not going to be relevant either without preprocessing. To keep things short in this tutorial, we are simply going to remove both columns. We also want to dichotomize or <a href="http://pbpython.com/categorical-encoding.html">label-encode</a> the gender for each passenger mapping male to 1 and female to 0. Finally we want to transform the data frame back into a numpy float32 array, because that's what our network expects. To achieve those things I wrote a small function that works on a pandas dataframe does those things:</p> <pre><code class="language-python">#Transform the data def preprocess(r): r = r.drop([1, 6], axis=1,errors='ignore') r[2] = r[2].astype('category') r[2] = r[2].cat.codes for column in r.columns: r[column] = r[column].astype(np.float32) return r.values X_train = preprocess(X_train) pd.DataFrame(X_train).head()</code></pre> <p>We see that after the transformation the gender in the data frame is encoded as zeros and ones. </p> <img src="https://liip.rokka.io/www_inarticle/738ba2/transformed.png" alt="transformed" format="png"> <h3>Step 3: Build the network</h3> <p>Now we can finally build our deep learning network which is going to learn the data. First of all, we specify the shape of our input data. The input sample has a total of 6 features, and we will process samples per batch to save memory. The None parameter means an unknown dimension, so we can change the total number of samples that are processed in a batch. So our data input shape is [None, 6]. Finally, we build a three-layer neural network with this simple sequence of statements. </p> <pre><code class="language-python">net = tflearn.input_data(shape=[None, 6]) net = tflearn.fully_connected(net, 32) net = tflearn.fully_connected(net, 32) net = tflearn.fully_connected(net, 2, activation='softmax') net = tflearn.regression(net)</code></pre> <p>If you want to visualize this network you can use Tensorboard to do so, although there will be not much to see (see below). Tensorflow won't draw all the nodes and edges but rather abstract whole layers as one box. To have a look at it you need to start it in your console and it will then become available on <a href="http://localhost:6006">http://localhost:6006</a> . Make sure to use a chrome browser when you are looking at the graphs, safari crashed for me. </p> <img src="https://liip.rokka.io/www_inarticle/6e070b/tensorboard.png" alt="tensorboard" format="png"> <pre><code class="language-bash">sudo tensorboard --logdir=/tmp</code></pre> <p>What we basically have are 6 nodes, which are our inputs. These inputs are then connected to 32 nodes, which are then all fully connected to another 32 nodes, which are then connected to our 2 output nodes: one for survival, the other for death. The activation function <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a> is the way to define when a node &quot;fires&quot;. It is one option among among others like <a href="https://de.wikipedia.org/wiki/Sigmoidfunktion">sigmoid</a> or <a href="https://en.wikipedia.org/wiki/Rectifier">relu</a>. Below you see a schematic I drew with graphviz based on a dot file, that you can download <a href="https://github.com/plotti/titanic_with_tflearn">here</a> . Instead of 32 nodes in the hidden layers I just drew 8, but you hopefully get the idea. </p> <img src="https://liip.rokka.io/www_inarticle/265e2c/graph-txt.png" alt="Graph" format="png"> <h3>Step 4: Train it</h3> <p>TFLearn provides a wrapper called Deep Neural Network (DNN) that automatically performs neural network classifier tasks, such as training, prediction, save/restore, and more. I think this is pretty handy. We will run it for 20 epochs, which means that the network will see all the data 20 times with a batch size of 32, which means that it will take in 32 samples at once. We will create one model without <a href="https://en.wikipedia.org/wiki/Cross-validation">cross validation</a> and one with it to see which one performs better. </p> <pre><code class="language-python"># Define model model = tflearn.DNN(net) # Start training (apply gradient descent algorithm) model.fit(X_train, y_train, n_epoch=20, batch_size=32, show_metric=True) # With cross validation if you want model2 = tflearn.DNN(net) model2.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True, validation_set=0.1) </code></pre> <h3>Step 4: Evaluate it</h3> <p>Well finally we've got our model and can now see how well it really performs. This is easy to do with:</p> <pre><code class="language-python">#Evaluation X_test = preprocess(X_test) metric_train = model.evaluate(X_train, y_train) metric_test = model.evaluate(X_test, y_test) metric_train_1 = model2.evaluate(X_train, y_train) metric_test_1 = model2.evaluate(X_test, y_test) print('Model 1 Accuracy on train set: %.9f' % metric_train[0]) print("Model 1 Accuracy on test set: %.9f" % metric_test[0]) print('Model 2 Accuracy on train set: %.9f' % metric_train_1[0]) print("Model 2 Accuracy on test set: %.9f" % metric_test_1[0])</code></pre> <p>The output gave me very similar results for the train set (0.78) and the test set (0.77) for both the normal and the cross validated model. So for this small example the cross validation does not really seem to play a difference. Both models do a fairly good job at predicting the survival rate of the Titanic passengers. </p> <h3>Step 5: Use it to predict</h3> <p>We can finally see what Leonardo DiCaprio's (Jack) and Kate Winslet's (Rose) survival chances really were when they boarded that ship. To do so I modeled both by their attributes. So for example Jack boarded third class (today called the economy class), was male, 19 years old had no siblings or parents on board and payed only 5$ for his passenger fare. Rose traveled of course first class, was female, 17 years old, had a sibling and two parents on board and paid 100$ for her ticket. </p> <pre><code class="language-python"># Let's create some data for DiCaprio and Winslet dicaprio = [3, 'Jack Dawson', 'male', 19, 0, 0, 'N/A', 5.0000] winslet = [1, 'Rose DeWitt Bukater', 'female', 17, 1, 2, 'N/A', 100.0000] # Preprocess data dicaprio, winslet = preprocess([dicaprio, winslet], to_ignore) # Predict surviving chances (class 1 results) pred = model.predict([dicaprio, winslet]) print("DiCaprio Surviving Rate:", pred[0][1]) print("Winslet Surviving Rate:", pred[1][1])</code></pre> <p>The output gives us:</p> <ul> <li><strong>DiCaprio Surviving Rate: 0.128768</strong></li> <li><strong>Winslet Surviving Rate: 0.903721</strong></li> </ul> <img src="https://liip.rokka.io/www_inarticle/31dc67/sinking.jpg" alt="Rigged game"> <h3>Conclusion, what's next?</h3> <p>So after all we know it was a rigged game. Given his background DiCaprio really had low chances of surviving this disaster. While we didn't really learn anything new about the outcome of the movie, I hope that you enjoyed this quick intro into Tensorflow and Tflearn that are not really hard to get into and don't have to end with a disaster. </p> <p>In our example there was really no need to pull out the big guns, a simple regression or any other machine learning method would have worked fine. Tensorflow, TFlearn or Keras really shine though when it comes to Image, Text and Audio recognition tasks. With the very popular Keras library we are almost able to reduce the boilerplate for these tasks even more, which I will cover in one of the future blog posts. In the meantime I encourage you to play around with neural networks in your browser in this <a href="https://playground.tensorflow.org">excellent</a> <a href="https://teachablemachine.withgoogle.com">examples</a> and am looking forward for your comments and hope that you enjoyed this little fun blog post. If you want you can download the Ipython notebook for this example <a href="https://github.com/plotti/titanic_with_tflearn/blob/master/Titanic%20example.ipynb">here</a> .</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/wMLz3H-a1CI" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/tensorflow-and-tflearn-or-can-deep-learning-predict-if-dicaprio-could-have-survived-the-titanic Neo4j graph database and GraphQL: A perfect match http://feeds.liip.ch/~r/liip_blog/~3/6jt2Q_jrFog/neo4j-and-graphql-a-perfect-match https://www.liip.ch/en/blog/neo4j-and-graphql-a-perfect-match Tue, 24 Apr 2018 00:00:00 +0200 <p>As a graph database enthusiast I often got asked the question “So, as one of our graph expert, can you answer this GraphQL question?”. </p> <p>Every time I have said that a graph database is not the same as GraphQL, and that I was unable to answer the question. </p> <p>Another colleague of mine, Xavier, had the same experience. So we decided to learn about GraphQL so that we could answer the questions, and because it seemed like a really cool way to write API’s. </p> <p>So before we dig into this any further, a quick overview of what a graph database is and what GraphQL is: </p> <ul> <li>A graph database is a database storing data in a graph, which is super performant when it comes to data which has relations to other data, which is a lot of our modern complex data sets.</li> <li>GraphQL is a way to query data. A flexible API protocol that lets you query just the data that you need and get that in return. Any storage can be behind this data, GraphQL is <em>not</em> responsible for storage. </li> </ul> <p>It was me and Emanuele on the innoday, where we decided to try out Neo4j with the GraphQL extension. </p> <p>Emanuele had never tried Neo4j before, and was amazed at how easy it is to get going. GraphQL was also super easy to install. Basically what you do is: </p> <ol> <li>Download Neo4j desktop</li> <li>Go trough the very obvious setup</li> <li>Click “install” on the plugin GraphQL which is already listed</li> <li>Use “browser” to populate data</li> <li>Query the data with GraphQL that automatically, thanks to the plugin, has a schema for you</li> </ol> <p>With this technology we aim to solve a problem that we currently have. The problem is one of our REST API’s which has a LOT of data. When a consumer queries for something that they need, they will get a lot of things that they don’t need, that another consumer may need. We end up wasting bandwidth.</p> <p>How does GraphQL solve this? And where does Neo4j come in?<br /> In GraphQL we query only for the data that we need, making a POST request. A request may look like: </p> <pre><code>{ Product { title product_id } }</code></pre> <p>If we have a graph database with a single node with the label “Product” and the properties “title” and “product_id” we would get this behaviour out of the box with the GraphQL plugin. Tada! A simple API is created, all you had to do was add the data! DONE. </p> <p>This makes it very easy for the consumer to query JUST for the data that they need. They can also see the entire schema where we can write descriptions in the form of comments, so that the consumer knows what to do with it.</p> <p>How does Neo4j work, and how do you add data? It is super simple! With Neo4j desktop, you click “manage” on your database and then “Open Browser”, in browser you can browse your data and execute queries in a graphical way.</p> <img src="https://liip.rokka.io/www_inarticle/69fb93/desktop.png" alt="desktop" format="png"> <img src="https://liip.rokka.io/www_inarticle/bd619a/browser.png" alt="browser" format="png"> <p>You can use browser to click trough and start learning, it is also a great tool for learning and experimenting, so as for learning Neo4j I leave you with browser to try your own hands on experience. Have fun playing with it! They explain it really well.</p> <img src="https://liip.rokka.io/www_inarticle/ea5f3c/browser-learning.png" alt="browser-learning" format="png"> <p>Ok, so simple graphs are very easy to query for, but how do we query for relationships with GraphQL and Neo4j? Let’s say we have a node called “Product” again, the node does not have the “title” directly on the product, but in a node labeled “TranslatedProductProperties”. This node will have all the translated properties that the product may need. Such as the title. The relation is called “in_language”. The relation itself has a property called “language”. This is how the actual graph looks like in browser:</p> <img src="https://liip.rokka.io/www_inarticle/29d08c/graph.png" alt="graph" format="png"> <p>In GraphQL we would need a custom IDL (schema) for that to be queried nicely. Our goal would be to query it the following way, we simply give it the title and the language, and we as GraphQL don’t need to know about the actual relations: </p> <pre><code>{ Product { product_id title(language: "de") } }</code></pre> <p>A simple IDL for that could look like: </p> <pre><code>type Product { product_id: String # The title in your provided language =D title(language: String = "de"): String @cypher(statement: "MATCH(this)-[:in_language {language:$language}]-(pp:ProductProperties) return pp.title") }</code></pre> <p>So here you can see that with the help of the GraphQL plugin we could make a cypher query to get whatever data we want for that field. Cypher is the query language for Neo4j.</p> <p>These things together being able to get any nodes and relationships, which are already decoupled unless you choose to couple them, and the querying flexibility makes Neo4j + GraphQL an API as flexible as... Rubber? An awesome match!</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/6jt2Q_jrFog" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/neo4j-and-graphql-a-perfect-match Make open data discoverable for search engines http://feeds.liip.ch/~r/liip_blog/~3/NU2p1k-v5EM/make-open-data-discoverable-for-search-engines https://www.liip.ch/en/blog/make-open-data-discoverable-for-search-engines Tue, 24 Apr 2018 00:00:00 +0200 <p>Open data portals are a great way to discover datasets and present them to the public. But they lack interoperability and it’s thus even harder to search across them. Imagine if you’re looking for a dataset it’s just a simple “google search” away. Historically there are <a href="http://rs.tdwg.org/dwc/index.htm">lots</a> <a href="http://www.dcc.ac.uk/resources/metadata-standards/ddi-data-documentation-initiative">and</a> <a href="https://www.w3.org/TR/vocab-data-cube/">lots</a> <a href="https://frictionlessdata.io/specs/data-packages/">of</a> <a href="https://www.iso.org/standard/26020.html">metadata</a> <a href="https://www.loc.gov/marc/">standards</a>. CKAN as the de-facto standard uses a model that is close to <a href="http://dublincore.org/specifications/">Dublin Core</a>. It consists of 15 basic fields to describe a dataset and its related resources.</p> <p>In the area of Open Government Data (OGD) the metadata standard that is widely used is <a href="https://www.w3.org/TR/vocab-dcat/">DCAT</a>. Especially the application profiles (“DCAT-AP”), which are a specialization of the DCAT standard for certain topic areas or countries. For CKAN the <a href="https://github.com/ckan/ckanext-dcat">ckanext-dcat</a> extension provides plugins to expose and consume DCAT-compatible data using an RDF graph. We use this extension on <a href="https://opendata.swiss/">opendata.swiss</a> and <a href="https://data.stadt-zuerich.ch">data.stadt-zuerich.ch</a>, as it provides handy interfaces to extend it to our custom data model. I’m a regular code contributor to the extension.</p> <p>When Dan Brickley working for Google, <a href="https://github.com/ckan/ckanext-dcat/issues/75">opened an issue on the DCAT extension</a> about implementing schema.org/Dataset for CKAN, I was very excited. I only learned about it in December 2017 and thought it would be a fun feature to implement over the holidays. But what exactly was Dan suggesting?</p> <p>With ckanext-dcat we already have the bridge from our relational (“database”) model to a graph (“linked data”). This is a huge step enables new uses of our data. Remember the 5 star model of Sir Tim Berners-Lee?</p> <img src="https://liip.rokka.io/www_inarticle/5a61db/5-star-steps.jpg" alt="5 star model describing the quality of the data"> <p>Source: <a href="http://5stardata.info/en/">http://5stardata.info/en/</a>, CC-Zero</p> <p>So with our RDF, we already reached 4 stars! Now imagine a search engine takes all those RDFs, and is able to search in them and eventually is even able to connect them together. This is where schema.org/Dataset comes in. Based on the request from Dan I built a <a href="https://github.com/ckan/ckanext-dcat/pull/108">feature in ckanext-dcat</a> to map the DCAT dataset to a schema.org/Dataset. By default it is returning the data as <a href="https://json-ld.org/">JSON-LD</a>. </p> <p>Even if you’ve never heard of JSON-LD, chances are, that you’ve used it. Google is promoting it with the keyword <a href="https://developers.google.com/search/docs/guides/intro-structured-data">Structured Data</a>. At its core, JSON-LD is a JSON representation of an RDF graph. But Google is pushing this standard forward to enable all kinds of “semantic web” applications. The goal is to let a computer understand the content of a website or any other content that has JSON-LD embedded.<br /> And in the future, Google wants to have a better understanding of the concept of a <a href="https://developers.google.com/search/docs/data-types/dataset">“dataset”</a>, or to put it in the words of Dan Brickley:</p> <blockquote> <p>It's unusual for Google to talk much about search feature plans in advance, but in this case I can say with confidence &quot;we are still figuring out the details!&quot;, and that the shape of actual real-world data will be a critical part of that. That is why we put up the documentation as early as possible. If all goes according to plan, we will indeed make it substantially easier for people to find datasets via Google; whether that is via the main UI or a dedicated interface (or both) is yet to be determined. Dataset search has various special challenges which is why we need to be non-comital on the details at the stage, and why we hope publishers will engage with the effort even if it's in its early stages...</p> </blockquote> <p>This feature is deployed on the CKAN demo instance, so let’s look at an example. I can use the API to get a dataset as JSON-LD. So for the dataset <a href="https://demo.ckan.org/dataset/energy-in-malaga">Energy in Málaga</a>, I could build the URL like that:</p> <ul> <li>Append “.jsonld”</li> <li>Specify “schemaorg” as the profile (i.e. the format of the mapping)</li> </ul> <p>Et voilà: <a href="https://demo.ckan.org/dataset/energy-in-malaga.jsonld?profiles=schemaorg">https://demo.ckan.org/dataset/energy-in-malaga.jsonld?profiles=schemaorg</a></p> <p>This is the result as JSON-LD:</p> <pre><code class="language-json"> { "@context": { "adms": "http://www.w3.org/ns/adms#", "dcat": "http://www.w3.org/ns/dcat#", "dct": "http://purl.org/dc/terms/", "foaf": "http://xmlns.com/foaf/0.1/", "gsp": "http://www.opengis.net/ont/geosparql#", "locn": "http://www.w3.org/ns/locn#", "owl": "http://www.w3.org/2002/07/owl#", "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#", "rdfs": "http://www.w3.org/2000/01/rdf-schema#", "schema": "http://schema.org/", "skos": "http://www.w3.org/2004/02/skos/core#", "time": "http://www.w3.org/2006/time", "vcard": "http://www.w3.org/2006/vcard/ns#", "xsd": "http://www.w3.org/2001/XMLSchema#" }, "@graph": [ { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9", "@type": "dcat:Dataset", "dcat:contactPoint": { "@id": "_:N71006d3e0205458db0cc7ced676f91e0" }, "dcat:distribution": [ { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/c3c5b857-24e7-4df7-ae1e-8fbe29db93f3" }, { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/5ecbfa6c-9ea0-4f5f-9fbe-eb39964c0f7f" }, { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/b74584c7-9a9a-4528-9c73-dc23b29c084d" } ], "dcat:keyword": [ "energy", "málaga" ], "dct:description": "Some energy related sources from the city of Málaga", "dct:identifier": "c8689e49-4fb2-43dd-85dd-ee243104a2a9", "dct:issued": { "@type": "xsd:dateTime", "@value": "2017-06-25T17:02:11.406471" }, "dct:modified": { "@type": "xsd:dateTime", "@value": "2017-06-25T17:05:24.777086" }, "dct:publisher": { "@id": "https://demo.ckan.org/organization/f0656b3a-9802-46cf-bb19-024573be43ec" }, "dct:title": "Energy in Málaga" }, { "@id": "https://demo.ckan.org/organization/f0656b3a-9802-46cf-bb19-024573be43ec", "@type": "foaf:Organization", "foaf:name": "BigMasterUMA1617" }, { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/b74584c7-9a9a-4528-9c73-dc23b29c084d", "@type": "dcat:Distribution", "dcat:accessURL": { "@id": "http://datosabiertos.malaga.eu/recursos/energia/ecopuntos/ecoPuntos-23030.csv" }, "dct:description": "Ecopuntos de la ciudad de málaga", "dct:format": "CSV", "dct:title": "Ecopuntos" }, { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/c3c5b857-24e7-4df7-ae1e-8fbe29db93f3", "@type": "dcat:Distribution", "dcat:accessURL": { "@id": "http://datosabiertos.malaga.eu/recursos/ambiente/telec/201706.csv" }, "dct:description": "Los datos se corresponden a la información que se ha decidido historizar de los sensores instalados en cuadros eléctricos de distintas zonas de Málaga.", "dct:format": "CSV", "dct:title": "Lecturas cuadros eléctricos Junio 2017" }, { "@id": "https://demo.ckan.org/dataset/c8689e49-4fb2-43dd-85dd-ee243104a2a9/resource/5ecbfa6c-9ea0-4f5f-9fbe-eb39964c0f7f", "@type": "dcat:Distribution", "dcat:accessURL": { "@id": "http://datosabiertos.malaga.eu/recursos/ambiente/telec/nodos.csv" }, "dct:description": "Destalle de los cuadros eléctricos con sensores instalados para su gestión remota.", "dct:format": "CSV", "dct:title": "Cuadros eléctricos" }, { "@id": "_:N71006d3e0205458db0cc7ced676f91e0", "@type": "vcard:Organization", "vcard:fn": "Gabriel Requena", "vcard:hasEmail": "gabi@email.com" } ] }</code></pre> <p>Google even provides a <a href="https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fdemo.ckan.org%2Fdataset%2Fenergy-in-malaga.jsonld%3Fprofiles%3Dschemaorg">Structured Data Testing Tool</a> where you can submit a URL and it will tell you if the data is valid.</p> <p>Of course knowing the CKAN API is good if you’re a developer, but not really the way to go if you want a search engine to find you datasets. So the JSON-LD that you can see above, is already embedded on the <a href="https://demo.ckan.org/dataset/energy-in-malaga">dataset page</a> (check out the <a href="https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fdemo.ckan.org%2Fdataset%2Fenergy-in-malaga">testing tool with just the dataset URL</a>) . So if you have enabled this feature, every time a search engine visits your portal, it’ll get structured information about the dataset it crawls instead of simply the HTML of the page. <a href="https://github.com/ckan/ckanext-dcat#structured-data">Check the documentation</a> for more information, but most importantly: if you’re running CKAN, give it a try! </p> <p>By the way: if you already have a custom profile for ckanext-dcat (e.g. for a DCAT application profile), check my current <a href="https://github.com/opendata-swiss/ckanext-switzerland/pull/177">pull request for a mapping DCAT-AP Switzerland to schema.org/Dataset</a>.</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/NU2p1k-v5EM" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/make-open-data-discoverable-for-search-engines Let's question the place of humans http://feeds.liip.ch/~r/liip_blog/~3/C5i0vIWJcmI/let-s-question-the-place-of-humans https://www.liip.ch/en/blog/let-s-question-the-place-of-humans Tue, 24 Apr 2018 00:00:00 +0200 <p>Photo credit: <a href="http://www.stemutz.ch/">Stemutz Photo</a> during TEDx Fribourg, April 16th 2018.</p> <h2>The Rise of the Robots</h2> <p>Do robots shove aside their human counterparts? What is our place, as human beings, in a future where robots are becoming more performant and skilled? Popular culture certainely has its set of stories. Some of them are funny, others are scary. This subject triggers as much anxiety as hopes.<br /> For example, in WIRED, Marguerite McNeal notes that &quot;one multi-tasker bot, from Momentum Machines, can make (and flip) a gourmet hamburger in 10 seconds and could soon replace an entire McDonalds crew&quot; in <em><a href="https://www.wired.com/brandlab/2015/04/rise-machines-future-lots-robots-jobs-humans/">The Rise of the Machine: The Future has lots of Robots, few Jobs for Human</a></em>.<br /> As a tech enthusiast, I can only recommend you her article, where she interviews Martin Ford. According to his optimistic vision <em>&quot;humans will live more productive and entrepreneurial lives, subsisting on guaranteed incomes generated by our amazing machines. &quot;</em></p> <h2>Et demain, encore humains?</h2> <p>Each TEDx-event is fascinating, especially because of its innovative spirit. With &quot;And tomorrow, still human?&quot; TEDx Fribourg investigated the subject of artificial intelligence and robots with various speakers. Martin Ford was not one of the speaker (sigh) but the team still created a captivating programm (<a href="http://2018.tedxfribourg.com/programme/">read the full programm here</a>).<br /> Below are a only few impressions, to make sure you join us at the next TEDx Fribourg.</p> <h3>Marc Atallah</h3> <img src="https://liip.rokka.io/www_inarticle/82ab66/tedxfribourg-2018-26783277637-o.jpg" alt="tedxfribourg-2018-26783277637-o"> <p>Marc Atallah is director of Maison d'Ailleurs (a museum of science fiction, utopia and extraordinary travels) and a lecturer and research professor at the French Section of the University of Lausanne. His research focuses on conjectural literatures (utopia, dystopia, imaginary journeys, science fiction) and literary theories (genre theories, theories of fiction).</p> <h3>Hervé Bourlard</h3> <img src="https://liip.rokka.io/www_inarticle/598240/tedxfribourg-2018-27782505848-o.jpg" alt="tedxfribourg-2018-27782505848-o"> <p>Hervé Bourlard is currently Director of the Idiap Research Institute in Martigny, EPFL Ordinary Professor, Vice-President of the Board of the IdeArk startup incubator, Member of the Board of Directors of the International Computer Science Institute at Berkeley , and adviser to the ECs.</p> <h3>David Ott</h3> <img src="https://liip.rokka.io/www_inarticle/14bef0/tedxfribourg-2018-39843684650-o.jpg" alt="tedxfribourg-2018-39843684650-o"> <p>In 2016, David Ott co-founded the Global Humanitarian Lab, a platform for sharing innovation for a partnership of humanitarian organizations. David deploys fab labs in humanitarian contexts for the affected populations and develops, using a global network, a sustainable fab lab model to respond to humanitarian issues.</p> <h2>Ingwerer drinks for everyone</h2> <p>After our legendary mojitos, we switch for Ingwerer. All local and bio of course.<br /> Meet us at TEDx Bern to taste them :D</p> <img src="https://liip.rokka.io/www_inarticle/c537aa/tedxfribourg-2018-39843614330-o.jpg" alt="tedxfribourg-2018-39843614330-o"> <p>Congratulations to everyone, the speakers for sharing their inspiration and the whole TEDx Fribourg team for its fantastic work. We very much enjoyed the event.<br /> <a href="https://www.flickr.com/photos/132728463@N02/sets/72157694217778091">Discover the full photo album.</a></p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/C5i0vIWJcmI" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/let-s-question-the-place-of-humans When blockchain serves human rights: an uplifting use-case http://feeds.liip.ch/~r/liip_blog/~3/KdX_YTrzv_Y/when-blockchain-serves-human-rights-an-uplifting-use-case https://www.liip.ch/en/blog/when-blockchain-serves-human-rights-an-uplifting-use-case Fri, 20 Apr 2018 00:00:00 +0200 <p>It’s been a while since we’re interested in blockchain technology as we’ve always been watching closely the emergence of the latest frameworks that drive innovation.<br /> Of course a few second-thoughts emerged here and there, considering the notorious ecological footprint of Bitcoin mining and also may be considering an instinctive defiance towards high-end finance systems that would blindly leverage Tech innovations for short-time personal profit.</p> <p>As a reminder our key values are authenticity, accessibility and openness, so let’s put ourselves in openness-mode and have a close look into a recent Swiss blockchain project that was rewarded with the first price of the SRG /SSR <a href="https://www.hackdays.ch">Hackdays</a>in Zurich namely : “<strong>Blockchain Reporter</strong>” - originally conceived by <strong>Marko Nalis</strong> who teamed up with a few developers at the event and with whom he elaborated the project during the 13 hours the Hackathon took place.</p> <p><strong>“306” – in reference to the number of journalist imprisoned in Jan / Feb 2018 worldwide</strong></p> <p>Introducing his presentation with an intriguing image displaying the number “306”, Marko Nalis explained that this is the number of journalists imprisoned during the two first months of the year only. They are imprisoned in countries where free speech is not granted. Yes there are still a lot of those… ;-(</p> <p>Starting from the recognition that journalists and reporters are <strong>exposed to persecution when personally identified (#1)</strong> and considering that they also <strong>suffer from a constant reduction of their financial means (#2)</strong> in a media landscape where advertising funds erode, the team came up with this cutting-edge blockchain concept : a <strong>decentralized peer-to-peer platform that enables journalists and readers to connect directly </strong>while injecting value into the ecosystem through tokenization &amp; network effect.</p> <p>There is more : through the magic touch of decentralized blockchain technology, other features are addressed too: </p> <ul> <li>anonymity of journalists, </li> <li>guarantee of safe payment, </li> <li>avoidance of censorship, </li> <li>authenticity of publications, </li> <li>prevention of fake news</li> <li>protection of the platform against arbitrary shut down by a totalitarian government (because decentralization prevents action against a unique IP server ;-).</li> </ul> <p><strong>Removal of the “middleman”</strong></p> <p>At this stage it’s already amazing to see how blockchain’s decentralization effect (i.e. removing the “<a href="https://www.ted.com/watch/ted-institute/ted-bcg/blockchain-and-the-middleman">middleman</a>” – here the press agency –), simultaneously solves numerous parameters of the equation.<br /> But let’s dive a little deeper into the functioning of the “<a href="https://www.youtube.com/watch?v=XatEoU36U-o">token economy</a>” (i.e. how publications and rewards are made possible by the community).<br /> Nota Bene: this might become a little technical but we will try to guide you to illustrate the three elements, that are technically bound together to allow all parts to talk/interact.</p> <p><strong>Part 1 : ever heard of “Smart Contracts”?</strong></p> <p><a href="https://blockgeeks.com/guides/smart-contracts/">Smart Contracts</a> <strong>are independent, self-executing pieces of codes</strong> that are immutable, not controlled by anyone and can’t be manipulated – unless 51% on the users would own the blockchain – which is most unlikely on popular blockchains like Ethereum and Bitcoin.<br /> Smart Contracts will fulfil some kind of “contract” a given condition is fulfilled once (it’s similar to the principle of IFTTT rules /“if this - then that” attribution rules, but more complex).<br /> Mostly run on Ethereum which supports more opportunities than the Bitcoin blockchain (where you basically can transfer value only), these <a href="https://blockgeeks.com/guides/smart-contracts/">Smart Contracts</a> don’t belong to somebody and will preserve the anonymity of journalists.</p> <p><strong>Part 2 : Linking the reference of the Smart Contract to a Public Address.</strong></p> <p>Let’s say a journalist filmed a given footage and wants to share it on the platform anonymously. Therefore he will use a <strong>dedicated android-based APP</strong> which allows uploading the content to a decentralized file-system called <a href="https://ipfs.io">IPFS</a> (=InterPlanetary File System).<br /> IPFS uses a merkle tree architecture allowing the uploaded content to be split in chunks of data that are distributed on the network. The architecture ensures immutability and no single point of failure. (The blockchain itself not being efficient to store rich content, -due to transaction speed and cost-, it is used solely to publish a reference to the content in a specific Smart Contract).</p> <p>Inside the Smart Contract the Public Address of a journalist is linked to his content and cryptographically signed. Thanks to the public address, the platform user will always be sure that the content of a given journalist is authentic - without even needing to know his real identity.</p> <p><strong>PART 3 : a “<a href="https://blockgeeks.com/guides/dapps/">DAPP”</a> / <a href="https://blockgeeks.com/guides/dapps/">decentralized APP</a> – linking the platform to Ethereum </strong></p> <p>Last the system needs a way to allow you “talking” to the Ethereum online wallet of the user/member.<br /> A browser using HTML/ Javascript, adding <a href="https://www.lifewire.com/what-is-web-3-0-3486623">Web 3 objects</a> through a plug-in called Metamask. Web 3 enables you to make calls to the Ethereum network and call smart contract functions in Javascript. As these interactions with the blockchain will cost (as any open-source system Ethereum needs to be sustained) a small fee called “<a href="https://ethereum.stackexchange.com/questions/3/what-is-meant-by-the-term-gas">gas</a>” charged to get the transactions through the network.<br /> The system allows users to donate ether currency to journalists through the smart code. These donations are also used to rate the popularity of journalists :<strong> the more donations</strong> for a given article, the <strong>better is the rating. </strong></p> <p>Finally, a dedicated Ethereum framework called “<a href="http://truffleframework.com">Truffle</a>” offers many tools to compile and deploy Smart Contracts and to set-up a test network. Its language is called “Solidity” and can be applied by any developer mastering Javascript, - in theory (although this quickly gets “a little complicated” as Marko Nalis says with a touch of understatement ;-)</p> <img src="https://liip.rokka.io/www_inarticle/9c33bf/blockchainreporter-overview.jpg" alt="blockchainreporter-overview"> <p><strong>In a nutshell – what’s in it for me?</strong></p> <p>So beyond the technical aspect of this analysis that will rather interest our Full Stack Developers, what innovation brings this “Blockchain Reporter” project?<br /> Blockchain Reporter appears as a<strong> real-life use-case designing concrete and applicable value-transferring models </strong>in an interesting vertical (the press industry).</p> <p>It was thrilling and uplifting to see how every parameter of the complex equation inherited its own place in the ecosystem organically (the donation/ranking logic, the token economy, the protection of the journalist). This is definitely a lesson for Liipers as this can help us design or deploy further models in other verticals.<br /> It demonstrated that blockchain technology is not only meant for worldwide corporate companies, but it helps defend human values such a <strong>free speech</strong> and liberty.</p> <p>We’re wishing long life and great success to the project “Blockchain Reporter”! </p> <img src="https://liip.rokka.io/www_inarticle/b4e490/reward-hackathon.png" alt="reward-hackathon" format="png"> <p>Here is the full LiipTalk from Marko Nalis : <a href="https://www.youtube.com/watch?v=P2CyfuC5EKA">youtube.com/watch?v=P2CyfuC5EKA</a></p> <p>For further interest about these topics of attribution and valutation of content in the medias, here is an interesting interview from a former journalist from Huffington Post who joined the startup <a href="https://www.cjr.org/innovations/blockchain-poet.php">Po.et</a> aiming to create the first globally-verifiable record of digital media assets: <a href="https://bit.ly/2FEw2yh">https://bit.ly/2FEw2yh</a></p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/KdX_YTrzv_Y" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/when-blockchain-serves-human-rights-an-uplifting-use-case The Data Science Stack 2018 http://feeds.liip.ch/~r/liip_blog/~3/zTOIxDy_vyI/the-data-science-stack-2018 https://www.liip.ch/en/blog/the-data-science-stack-2018 Mon, 16 Apr 2018 00:00:00 +0200 <p>More than one year ago I sat down and went through my various github stars and browser bookmarks to compile what I then called the Data Science stack. It was basically an exhaustive collection of tools from which some I use on a daily basis, while others I have only heard of. The outcome was a big PDF poster which you can download <a href="https://www.liip.ch/en/blog/data-stack">here</a>. </p> <p>The good thing about it was, that every tool I had in mind could be found there somewhere, and like a map I could instantly see to which category it belonged. As a bonus I was able to identify my personal white spots on the map. The bad thing about it was, that as soon as I have compiled the list, it was out of date. So I transferred the collection into a google sheet and whenever a new tool emerged on my horizon I added it there. Since then - in almost a year - I have added over 102 tools to it. </p> <h2>From PDF to Data Science Stack website</h2> <p>While it would be OK to release another PDF of the stack year after year, I thought that might be a better idea to turn this into website, where everybody can add tools to it.<br /> So without further ado I present you the <a href="http://datasciencestack.liip.ch">http://datasciencestack.liip.ch</a> page. Its goal is still to provide an orientation like the PDF, but eventually never become stale. </p> <img src="https://liip.rokka.io/www_inarticle/2dd1be/front.png" alt="frontpage" format="png"> <p><strong>Adding Tools: </strong>Adding tools to my google sheet felt a bit lonesome, so I asked others internally to add tools whenever they find new ones too. Finally when moving away from the old google sheet and opening our collection process to everybody I have added a little button on the website that allows everybody to add tools by themselves to the appropriate category. Just send us the name, link and a quick description and we will add it there after a quick sanity check. The goal is to gather user generated input too! The I am thinking also about turning the website into a “github awesome” repository, so that adding tools can be done more in a programmer friendly way. </p> <img src="https://liip.rokka.io/www_inarticle/855d2c/add.png" alt="adding tools for everyone" format="png"> <p><strong>Search:</strong> When entering new tools, I realized that I was not sure if that tool already exists on the page, and since tools are hidden away after the first five the CTRL+F approach didn’t really work. That's why the website now has a little search box to search if a tool is already in our list. If not just add it to the appropriate category. </p> <p><strong>Mailing List:</strong> If you are a busy person and want to stay on top of things, I would not expect you to regularly check back and search for changed entries. This is why I decided to send out a quarterly mailing that contains the new tools we have added since our last data science stack update. This helps you to quickly reconnect to this important topic and maybe also to discover a data science gem you have not heard of yet. </p> <p><strong>JSON download:</strong> Some people asked me for the raw data of the PDF and at that time I was not able to give it to them quickly enough. That's why I added a json route that allows you to simply download the whole collection as a json file and create your own visualizations / maps or stacks with the tools that we have collected. Maybe something cool is going to come out of this. </p> <p><strong>Communication:</strong> Scanning through such a big list of options can sometimes feel a bit overwhelming, especially since we don’t really provide any additional info or orientation on the site. That’s why I added multiple ways of contacting us, in case you are just right now searching for a solution for your business. I took the liberty to also link our blog posts that are tagged with machine learning at the bottom of the page, because often we make use of the tools in these. </p> <p><strong>Zebra integration:</strong> Although it's nowhere visible on the website, I have hooked up the data science stack to our internal “technology database” system, called Zebra (actually Zebra does a lot more, but for us the technology part is relevant). Whenever someone enters a new technology into our technology db, it is automatically added for review to the data science stack. Like this we are basically tapping into the collective knowledge of all of our employees our company. A screenshot below gives a glimpse of our tech db on zebra capturing not only the tool itself but also the common feelings towards it. </p> <img src="https://liip.rokka.io/www_inarticle/680599/zebra.png" alt="Zebra integration" format="png"> <h2>Insights from collecting tools for one more year</h2> <p>Furthermore, I would like to provide you with the questions that guided me in researching each area and the insights that I gathered in the year of maintaining this list. Below you see a little chart showing to which categories I have added the most tools in the last year. </p> <img src="https://liip.rokka.io/www_inarticle/caabb8/graphs2.png" alt="overview" format="png"> <h3>Data Sources</h3> <p>One of the remaining questions, for us is what tools do offer good and legally compliant ways to capture user interaction? Instead of Google Analytics being the norm, we are always on the lookout for new and fresh solutions in this area. Despite Heatmap Analytics, another new category I added is «Tag Management˚ Regarding the classic website analytics solutions, I was quite surprised that there are still quite a lot of new solutions popping up. I added a whole lot of solutions, and entirely new categories like mobile analytics and app store analytics after discovering that great github awesome list of analytics solutions <a href="https://github.com/onurakpolat/awesome-analytics">here</a>.</p> <img src="https://liip.rokka.io/www_inarticle/2aff10/sources2.png" alt="data sources" format="png"> <h3>Data Processing</h3> <p>How can we initially clean or transform the data? How and where can we store logs that are created by these transformation events? And where do we also take additional valuable data? Here I’ve added quite a few of tools in the ETL area and in the message queue category. It looks like eventually I will need to split up the “message queue” category into multiple ones, because it feels like this one drawer in the kitchen where everything ends up in a big mess. </p> <img src="https://liip.rokka.io/www_inarticle/732417/processing.png" alt="data processing" format="png"> <h3>Database</h3> <p>What options are out there to store the data? How can we search through it? How can we access data sources efficiently? Here I mainly added a few specialized solutions, such as databases focused on storing mainly time series or graph/network data. I might either have missed something, but I feel that since there is no new paradigm shift on the horizon right now (like graph oriented, or nosql, column oriented or newsql dbs). It is probably in the area of big-data where most of the new tools emerged. An awesome list that goes beyond our collection can be found <a href="https://github.com/onurakpolat/awesome-bigdata">here</a>.</p> <img src="https://liip.rokka.io/www_inarticle/64a802/database.png" alt="database" format="png"> <h3>Analysis</h3> <p>Which stats packages are available to analyze the data? What frameworks are out there to do machine learning, deep learning, computer vision, natural language processing? Obviously, due to the high momentum of deep learning leads to many new entries in this category. In the “general” category I’ve added quite a few entries, showing that there is still a huge momentum in the various areas of machine learning beyond only deep learning. Interestingly I did not find any new stats software packages, probably hinting that the paradigm of these one size fits all solutions is over. The party is probably taking place in the cloud, where the big five have constantly added more and more specialized machine learning solutions. For example for text, speech, image, video or chatbot/assistant related tasks, just to name a few. At least those were the areas where I added most of the new tools. Going beyond the focus on python there is the awesome <a href="https://github.com/josephmisiti/awesome-machine-learning">list</a> that covers solutions for almost every programming language. </p> <img src="https://liip.rokka.io/www_inarticle/ab1df5/analysis.png" alt="analysis" format="png"> <h3>Visualization, Dashboards, and Applications</h3> <p>What happens with the results? What options do we have to visually communicate them? How do we turn those visualizations into dashboards or entire applications? Which additional ways of to communicate with user beside reports/emails are out there? Surprisingly I’ve only added a few new entries here, may it be due to the fact that I accidentally have been quite thorough at research this area last year, or simply because of the fact that somehow the time of js visualizations popping up left and right has cooled off a bit and the existing solutions are rather maturing. Yet this awesome <a href="https://github.com/fasouto/awesome-dataviz">list</a> shows that development in this area is still far from cooling off. </p> <img src="https://liip.rokka.io/www_inarticle/cd8663/viz.png" alt="visualization" format="png"> <h3>Business Intelligence</h3> <p>What solutions do exist that try to integrate data sourcing, data storage, analysis and visualization in one package? What BI solutions are out there for big data? Are there platforms/solutions that offer more of a flexible data-scientist approach (e.g. free choice of methods, models, transformations)? Here I have added solutions that were platforms in the cloud, it seems that it is only logical to offer less and less of desktop oriented BI solutions, due to the restrained computational power or due to the high complexity of maintaining BI systems on premise. Although business intelligence solutions are less community and open source driven as the other stacks, there are also <a href="https://github.com/thenaturalist/awesome-business-intelligence">awsome lists</a> where people curate those solutions. </p> <img src="https://liip.rokka.io/www_inarticle/7fa6f4/bi.png" alt="business intelligence" format="png"> <p>You might have noticed that I tried to slip in an awsome list on github into almost every category to encourage you to look more in depth into each area. If you want to spend days of your life discovering awesome things, I strongly suggest you to check out this collection of awesome lists <a href="https://github.com/jnv/lists">here</a> or <a href="https://github.com/sindresorhus/awesome or">here</a>.</p> <h3>Conclusion or what's next?</h3> <p>I realized that keeping the list up to date in some areas seems almost impossible, while others gradually mature over time and the amount of new tools in those areas is easy to oversee. I also had to recognize that maintaining an exhaustive and always up to date list in those 5 broad categories seems quite a challenge. That's why I went out to get help. I’ve looked for people in our company interested particularly in one of these areas and nominated them technology ambassadors of this part of the stack. Their task will be to add new tools whenever they pop up on their horizon. </p> <p>I have also come to the conclusion that the stack is quite useful when offering customers a bit of an overview at the beginning of a journey. It adds value to just know what popular solutions are out there and start digging around yourself. Yet separating more mature tools from the experimental ones or knowing which open source solutions have a good community behind it, is quite a hard task for somebody without experience. Somehow it would be great to highlight “the pareto principle” in this stack by pointing out to only a handful of solutions and saying you will be fine when you use those. Yet I also have to acknowledge that this will not replace a good consultation in the long run. </p> <p>Already looking towards the improvement of this collection, I think that each tool needs some sort of scoring: While there could be plain vanilla tools that are mature and do the job, there are also the highly specialized very experimental tools that offers help in very niche area only. While this information is somewhat buried in my head, it would be good to make it explicit on the website. Here I am highly recommending what Thoughtworks has come up with in their <a href="https://www.thoughtworks.com/radar">technology radar</a>. Although their radar goes well beyond our little domain of data services, it offers a great idea to differentiate tools. Namely into four categories: </p> <ul> <li>Adopt: We feel strongly that the industry should be adopting these items. We see them when appropriate on our projects. </li> <li>Trial: Worth pursuing. It is important to understand how to build up this capability. Enterprises should try this technology on a project that can handle the risk. </li> <li>Asses: Worth exploring with the goal of understanding how it will affect your enterprise. </li> <li>Hold: Proceed with caution.</li> </ul> <img src="https://liip.rokka.io/www_inarticle/37daaf/radar.png" alt="Technology radar" format="png"> <p>Assessing tools according to these criteria is no easy task - thoughtworks is doing it by nominating a high profile jury that vote regularly on these tools. With 4500 employees, I am sure that their assessment is a representative sample of the industry. For us and our stack, a first start would be to adopt this differentiation, fill it out myself and then get other liipers to vote on these categories. To a certain degree we have already started this task internally in our tech db, where each employee assessed a common feeling towards a tool. </p> <p>Concluding this blogpost, I realized that the simple task of “just” having a list with relevant tools for each area seemed quite easy at the start. The more I think about it, and the more experience I collect in maintaining this list, the more realize that eventually such a list is growing into a knowledge and technology management system. While such systems have their benefits (e.g. in onboarding or quickly finding experts in an area) I feel that turning this list into one will be walking down this rabbit hole of which I might never re-emerge. Let’s see what the next year will bring.</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/zTOIxDy_vyI" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/the-data-science-stack-2018 Best of Swiss Web 2018 http://feeds.liip.ch/~r/liip_blog/~3/qFTawQGojB4/bosw2018 https://www.liip.ch/en/blog/bosw2018 Mon, 16 Apr 2018 00:00:00 +0200 <h2>Liip once again wins Best of Swiss Web awards</h2> <p>Liip gained four awards at the Best of Swiss Web 2018 awards ceremony, securing its position as one “Best of Swiss Web’s” top five award winners. Liip won gold in the Innovation category, silver in the Public Affairs categories and bronze in the Technology and Business category. Five different projects were submitted and they all reached the shortlist.</p> <h3>Western Swiss Tooyoo project wins gold</h3> <p>For the first time in Liip’s Best of Swiss Web award history, a western Swiss project was nominated for the Master award: Tooyoo – supporting your loved ones after your death. Tooyoo is a digital platform that securely stores the final requests of relatives who have died, as the death of a loved one can often bringe huge administrative challenges with it. Step by step, Tooyoo helps you to handle the administrative and organisational procedures which follow a death – reliably, easily and intuitively. The platform breaks what is a taboo subject in Switzerland and helps you to organise your ‘digital afterlife’ right now. The trend towards keeping online cemeteries as small as possible has been around for the past few years. Liip worked with Superhuit to implement this Mobiliar project. ‘It is the desire to succeed together that has allowed us to develop tooyoo.ch. We are very happy to have been able to count on Liip’s technical expertise and their product definition and development skills’, says Julien Ferrari, Operations Happyender in Chief for the Tooyoo start-up. Tooyoo won gold in the category Innovation.<br /> <a href="https://www.tooyoo.ch/">www.tooyoo.ch</a></p> <h3>rokka rocks – image processing made easy</h3> <p>‘Faster websites with smaller images reduce loading times and hosting costs’, is rokka’s motto. This product, owned by Liip, compresses images without any loss of quality: with rokka, image files are up to 80% smaller. Whether for handling image formats, SEO attributes, special image details or lightning fast delivery, rokka is the ideal tool for your digital images. More than five million images have already been uploaded and supplied, and data volumes of more than 120 GB are being managed. ‘rokka gives us a central image processing service for our images, which guarantees that images are stored and distributed in a fast, secure and efficient way. This means that we are able to display the same image in a wide variety of formats on any of our platforms, without significant effort required in terms of image management’ – Alain Petignat, Head of Online Development and Operations for Migros-Genossenschafts-Bund, on rokka. rokka won bronze in the Technology category.<br /> <a href="https://www.rokka.io/">www.rokka.io</a></p> <h3>ETH real estate platform supports architecture students</h3> <p>Liip developed an application for the Chair of Architecture and Building Process in ETH’s Department of Architecture. This includes an easy way of helping students to understand the economic mechanisms of real estate: how, where and under what conditions is it worthwhile constructing a building? These questions were all incorporated into the app, which also helps students to complete complicated calculations (building costs, capacity etc.). ‘Many students are afraid of our subject area because of the quantity of mathematics involved. On paper, completing lots of complicated calculation steps in a short period of time and combining data are very inflexible processes. Students can run into trouble in examinations if they make an error that corrupts all subsequent calculations and do not spot it until a late stage. The app helps with this, as modules can be revised at any time’ – Hannes Reichel, a lecturer in the Department of Architecture at ETH, on the real estate platform. The ETH real estate economics project was awarded a bronze in the Public Affairs category.</p> <h3>Migros working environment wins silver</h3> <p>The website migros-gruppe.jobs was developed to combine vacancies at the Migros Group’s over 60 companies in a single location. This job portal is aiding the Migros Group in an increasingly competitive employment market. The platform was created to serve as a unified port of call for job seekers. The aim of the project was to develop and create a new, exciting, comprehensive careers platform for the Migros Group that was relevant to the Swiss employment market. The job portal presents relevant, optimised information to the user in an easily comprehensible way. ‘Digitisation means that quicker reaction times are needed on the market. Launching a collective careers platform is the first step’, says Micol Rezzonico, head of the Competence Center, Employer Branding. The job portal supports the Migros Group’s employer branding, and thus its marketing strategy, and won silver in the Marketing category.<br /> <a href="https://migros-gruppe.jobs">migros-gruppe.jobs</a></p> <p>The entire Liip team is delighted with this success, and we are of course also very pleased about the successes of the other agencies: we would particularly like to congratulate Unic and SBB, named Master of Swiss Web 2018.</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/qFTawQGojB4" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/bosw2018 The Facebook Scandal - or how to predict psychological traits from Facebook likes. http://feeds.liip.ch/~r/liip_blog/~3/GxOs8WBMm9M/the-facebook-scandal-or-how-to-predict-psychological-traits-from-facebook-likes https://www.liip.ch/en/blog/the-facebook-scandal-or-how-to-predict-psychological-traits-from-facebook-likes Fri, 06 Apr 2018 00:00:00 +0200 <p>The facebook scandal or the tempest in a teapot seems to be everywhere these days. Due to it Facebook has lost billions in stock market value, governments on both sides of the Atlantic have opened investigations, and a social movement is calling on users to #DeleteFacebook. While the <a href="http://www.zeit.de/wirtschaft/2018-03/plattformkapitalismus-internetplattformen-regulierung-facebook-cambridge-analytica">press</a> around the <a href="https://www.usatoday.com/story/opinion/2018/03/22/facebook-mark-zuckerberg-not-our-friends-cambridge-analytica-column/448767002/">world</a> and also in <a href="https://www.nzz.ch/meinung/facebook-unschuldig-am-pranger-ld.1369562">Switzerland</a> are discussing back and forth, how big Facebook’s role was in the current data miss-use between, I thought it would be great to answer three central questions, which even I had after reading a couple of articles: </p> <ol> <li>It seems that CA obtained the data somewhat semi-legally, but everyone seems very upset about that. How did they do it and can you do it, too? :)</li> <li>People feel manipulated with big data and big buzzwords, yet practically how much can one deduce from our facebook likes? </li> <li>How does the whole thing work, predicting my most inner psychological traits from facebook likes, it seems almost like wizardry. </li> </ol> <p>To provide answers to these questions I have structured the article mainly around a theoretical part and a practical part. In the theoretical part I'd like to give you an introduction on psychology on Facebook, explaining the research behind it, showing how such data was collected initially and finally highlighting its implications for political campaigning. In the practical part I will walk you through a step by step example that shows how machine learning can deduce psychological traits from facebook data, showing you the methods, little tricks and actual results. </p> <h3>Big 5 Personality Traits</h3> <p>Back in 2016 an article was frantically shared in the DACH region. With its quite sensation-seeking title <a href="https://www.dasmagazin.ch/2016/12/03/ich-habe-nur-gezeigt-dass-es-die-bombe-gibt/?reduced=true">Ich habe nur geziegt dass es die Bombe gibt</a>, the article claimed that basically our most “inner” personality traits - often called the big 5 or <a href="https://en.wikipedia.org/wiki/Big_Five_personality_traits">OCEAN</a> - can be extracted from our Facebook usage and are used to manipulate us in political campaigns. </p> <img src="https://liip.rokka.io/www_inarticle/3c894b/taits.png" alt="personality traits" format="png"> <p>This article itself was based on the <a href="http://www.pnas.org/content/early/2015/01/07/1418680112">research paper</a> of Michal Kosinski and other Cambridge researchers who studied the correlation between Facebook likes and our personality traits, while their <a href="http://www.pnas.org/content/110/15/5802">older research</a> about this topic goes even back to 2013. Although some of these researchers ended up in the eye of the storm of this scandal, they are definitely not the <a href="https://www.sciencedirect.com/science/article/pii/S0191886917307328">only ones</a> studying this interesting field. </p> <p>In the OCEAN definition our personality traits are: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. While psychologists normally have to ask people around 100 questions to determine these traits - you can do an online test yourself <a href="https://www.psychometrictest.org.uk/big-five-personality/">online</a> to get an idea - the article discussed a way where Facebook data could be used to infer them instead. They showed that &quot;(i) computer predictions based on a generic digital footprint (Facebook Likes) are more accurate (r = 0.56) than those made by the participants’ Facebook friends using a personality questionnaire (r = 0.49)&quot;. This basically meant that by using a machine learning approach based on your Facebook data they were able to train a model that was better in describing you, than a friend of yours. They also found out that, the more facebook likes they have for a single person, the better the model is predicting those traits. So as you might have expected having more on users pays off in these terms. The picture below shows a birds-eye view of their approach.</p> <img src="https://liip.rokka.io/www_inarticle/cca203/approach.png" alt="the original approach from the paper" format="png"> <p>They also came to the conclusion that &quot;computers outpacing humans in personality judgment, presents significant opportunities and challenges in the areas of psychological assessment, marketing, and privacy.&quot; So in other words hinting that their approach might be quite usable beyond the academic domain. </p> <h3>Cambridge Analytica</h3> <p>This is how Cambridge Analytica (CA) comes into play, which has its name originally from the cooperation of the psychometrics department in Cambridge. Founded in 2014, their goal was to make use of such psychometric results in the political campaign domain. Making use of psychometric insights is nothing new per-se since at least the 1980ties different actors have been making use of <a href="https://de.wikipedia.org/wiki/Kategorie:Sozialwissenschaftliche_Klassifikation">various psychometric classifications</a> for various applications. For example the <a href="https://de.wikipedia.org/wiki/Sinus-Milieus">Sinus Milieus</a> are quite popular in the DACH region, mainly for marketing areas. </p> <p>The big dramatic shift in comparison to these “old” survey based approaches is that CA was doing two things differently: Firstly they focused specifically on Facebook as a data source and secondly they used Facebook as a platform primarily for their interventions. Instead of asking long and expensive questionnaires, they were able to collect such data in an easy manner and also, they reach their audience in a highly individualized way. </p> <h3>How did Cambridge Analytica get the data?</h3> <p>Cambridge researchers had originally created a Facebook app that was used to collect data for the research paper mentioned above. Users filled out a personality test on Facebook and then agreed on, that the app can collect their profile data. There is nothing wrong with that, except for the fact that at this time Facebook (and the users) also allowed the app to collect the personality profiles of their friends, who never participated in the first place. At this point I am not sure which of many of such apps (“mypersonality”, “thisisyourdigitallife” or others) was used to gather the data, but the result was that with this snowball system approach CA quickly collected data about roughly 50 Mio users. This data became the base for the work of CA. And CA was heading towards using this data to change the way political campaigns work forever. </p> <h3>Campaigns: From Mass communication to individualized communication</h3> <p>I would argue that the majority of people are still used to the classic political campaigns that are broadcasted to us via billboards-ads or TV-ads and have one message for everybody, thus following the old paradigm of <a href="https://en.wikipedia.org/wiki/Mass_communication">mass communication</a>. While such messages have a target group (e.g.women over 40), it is still hard to reach exactly those people, since so many other people are reached too, making this approach rather costly and ineffective.</p> <p>With the internet era things changed quickly here: In nowadays online advertising world a very detailed target group can be reached easily (via Google Ads or Facebook advertising) and each group can potentially receive a custom marketing message. Such <a href="https://de.wikipedia.org/wiki/Programmatic_Advertising">programmatic advertising</a> surely disrupted the adverising world, but CA’s idea was not to use this mechanism for advertising but for political campaigns. While quite a lot of <a href="https://www.washingtonpost.com/politics/cruz-campaign-paid-750000-to-psychographic-profiling-company/2015/10/19/6c83e508-743f-11e5-9cbb-790369643cf9_story.html">details</a> are known about those campaigns and some <a href="https://en.wikipedia.org/wiki/Cambridge_Analytica">shady</a> practices have been made visible, here I want to focus on the idea behind them. The really smart idea about these campaigns is that a political candidate can appeal to many different divergent groups of voters at the same time! This means that they could appear as safety loving to risk-averse persons and could appear as risk-loving to young entrepreneurs at the same time. So the same mechanisms that are used to convince people of a product are now being used to convince people to vote for a politician - and every one of these voters might do so because of individual reasons. Great isn't it? </p> <img src="https://liip.rokka.io/www_inarticle/64dd5a/dr-dave-inked.jpg" alt="doctor or tattoo artist?"> <p>This brings our theoretical part to the end. We know now they hows and whys behind Cambridge Analytica and why these new form of political campaigns matter. It's time now to find out how they were able to infer personality traits from facebook likes. I am sorry to disappoint you though, that I won’t cover on how to run a political campaign on Facebook in the next practical part.</p> <h3>Step1 : Get, read in and view data</h3> <p>What most people probably don’t know is, that one of the initial projects <a href="http://mypersonality.org">website</a> still offers a big sample of the collected (and of course anonymized) data for research purposes. I downloaded it and put it to use it for our small example. After reading in the data from the csv format we see that we have 3 Tables.</p> <pre><code class="language-python">users = pd.read_csv("users.csv") likes = pd.read_csv("likes.csv") ul = pd.read_csv("users-likes.csv")</code></pre> <img src="https://liip.rokka.io/www_inarticle/19d73b/users.png" alt="users table" format="png"> <p>The user table with 110 thousand contains information about demographic attributes (such as age and gender) as well as their measurements on the 5 different personality traits: openness to experience(ope), conscientiousness(con), extraversion(ext), agreeableness(agr), and neuroticism(neu).</p> <img src="https://liip.rokka.io/www_inarticle/6e471c/likes.png" alt="pages table" format="png"> <p>The likes table contains a list of all 1.5 Mio pages that have been liked by users. </p> <img src="https://liip.rokka.io/www_inarticle/965ac0/edges.png" alt="edges" format="png"> <p>The edges table contains all the 10 Mio edges between the users and the pages. In fact seen from a network research perspective our “small sample” network is already quite big for normal <a href="https://en.wikipedia.org/wiki/Social_network_analysis">social network analysis</a> standards. </p> <h3>Step 2: Merge Users with Likes in one dataframe aka the adjacency matrix</h3> <p>In step 2 we now want to create a network between users and pages. In order to do this we have to convert the edge-list to a so called <a href="https://en.wikipedia.org/wiki/Adjacency_matrix">adjacency matrix</a> (see image with an example below). In our case we want to use a sparse adjacency format, since we have roughly 10 Mio edges. The conversion, as shown below, transforms our edgelist into categorical integer numbers which are then used to create our matrix. By doing it in an ordered way, the rows of the resulting matrix still match the rows from our users table. This saves us a lot of cumbersome lookups. </p> <img src="https://liip.rokka.io/www_inarticle/a4b915/network4.png" alt="from edge list to adjacency matrix" format="png"> <pre><code class="language-python">#transforming the edge list into a sparse adjacency matrix rows = ul["userid"].astype('category',ordered=True,categories=users["userid"]).cat.codes # important to maintain order cols = ul["likeid"].astype("category",ordered=True,categories=likes["likeid"]).cat.codes ones = np.ones(len(rows), np.uint32).tolist() sparse_matrix = csc_matrix((ones, (rows, cols))) </code></pre> <p>If you are impatient like me, you might have tried to use this matrix directly to predict the users traits. I have tried it and the results are rather unsatisfactory, due to the fact that we have way too many very “noisy” features, which give us models that predict almost nothing of our personality traits. The next two steps can be considered feature engineering or simply good tricks that work well with network data. </p> <h3>Trick 1: Prune the network</h3> <p>To obtain more meaningful results, we want to prune this network and only retain the most active users and pages. Theoretically - while working with a library like <a href="https://networkx.github.io">networkx</a> - we could simply say: </p> <p>«Lets throw out all users with with a <a href="https://en.wikipedia.org/wiki/Network_science">degree</a> less than 5 and all pages with a degree of less than 20.» This would give us a network of users that are highly active and of pages that seem to be highly relevant. Often under social network research computing <a href="https://en.wikipedia.org/wiki/Degeneracy">k-cores</a> gives you a similar pruning effect. </p> <p>But since our network is quite big I have used a poor man's version of that pruning approach: We will be going through a while loop that throws out columns (in our case users) where the sum of their edges less than 50 and rows (pages) where the sum of their edges less than 150. This is equivalent to throwing out users that follow less than 50 pages and throwing out pages that have less than 150 users. Since a removal of a page or user might hurt one of our conditions again, we will continue reduce the network as long as still columns or rows need to be removed. Once both conditions are met the loop will stop. While pruning we are also updating our user and pages lists (via boolean filtering), so we can track which users like which pages and vice versa. </p> <pre><code class="language-python"># pruning the network into most relevant users and pages print(sparse_matrix.shape) max = 50 while True: i = sum(sparse_matrix.shape) columns_bool = (np.sum(sparse_matrix,axis=0)&gt;3*max).getA1() sparse_matrix = sparse_matrix[:, columns_bool] #columns likes = likes[columns_bool] rows_bool = (np.sum(sparse_matrix,axis=1)&gt;max).getA1() sparse_matrix = sparse_matrix[rows_bool] #rows users = users[rows_bool] print(sparse_matrix.shape) print(userst.shape) if sum(sparse_matrix.shape) == i: break</code></pre> <p>This process quite significantly reduces our network size to roughly a couple thousand users (19k) with a couple of thousands pages (8-9k) or attributes each. What I will not show here is my second failed attempt at predicting the psychological traits. While the results were slightly better due to the better signal to noise ratio, they were still quite unsatisfactory. That's where our second very classic trick comes into play: <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">dimensionality reduction</a>.</p> <h3>Trick 2: Dimensionality reduction or in our case SVD</h3> <p>The logic behind dimensionality reduction can be explained intuitively: Usually when we analyse user’s attributes we often find attributes, that describe the &quot;same things&quot; (latent variables or factors). Factors are artificial categories emerge by combining “similar” traits, that behave in a very “similar” way (e.g. your age and your amount of wrinkles). In our case these variables are pages that a user liked. So there might be a page called &quot;Britney Spears&quot; and another one &quot;Britney Spears Fans&quot; and all users that like the first also like the second. Intuitively we would want both pages to &quot;behave&quot; like one, so kind of merge them into one page. </p> <p>For such approaches a number of methods are available - although they all work a little bit different - the most used examples are <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal component analysis</a>, <a href="https://en.wikipedia.org/wiki/Singular-value_decomposition">Singular value decomposition</a>, <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis">Linear discriminant analysis</a>. These methods allow us to “summarize” or “compress” the dataset in as many dimensions as we want. And as a benefit these dimensions are sorted in the way that the most important ones come first. </p> <p>So instead of looking at a couple of thousand pages per user, we can now group them into 5 “buckets”. Each bucket will contain pages that are similar in regard to how users perceive these pages. Finally we can correlate these factors with the users personality traits. Scikit learn offers us a great way to perform a PCA on a big dataset with the incremental PCA method that even works with datasets that don’t fit into RAM. An even more popular approach (often used in recommender systems) is SVD, which is fast, easy to compute and yields good results. </p> <pre><code class="language-python">#performing dimensionality reduction svd = TruncatedSVD(n_components=5) # SVD #ipca = IncrementalPCA(n_components=5, batch_size=10) # PCA df = svd.fit_transform(sparse_matrix) #df = ipca.fit_transform(sparse_matrix)</code></pre> <p>In the case above I have reduced the thousands of pages into just 5 features for visualization purposes. We can now do a pairwise comparison between the personality traits and the 5 factors that we computed and visualize it in a heatmap. </p> <pre><code class="language-python">#generating a heatmap of user traits vs factors users_filtered = users[users.userid.isin(matrix.index)] tmp = users_filtered.iloc[:,1:9].values # remove userid, convert to np array combined = pd.DataFrame(np.concatenate((tmp, df), axis=1)) # one big df combined.columns=["gender","age","political","ope","con","ext","agr","neu","fac1","fac2","fac3","fac4","fac5"] heatmap = combined.corr().iloc[8:13].iloc[:,0:7] # remove unwanted columns sns.heatmap(heatmap, annot=True)</code></pre> <img src="https://liip.rokka.io/www_inarticle/bf7686/matrix.png" alt="heatmap" format="png"> <p>In the heatmap above we see that factor 3 seems to be quite highly positively correlated with the user's openness. We also see that factor 1 is negatively correlated with age. So the older you get the less you probably visit pages from this area. Generally we see though that the correlations between some factors and traits are not very high though (e.g. agreeableness) </p> <h3>Step 3: Finally build a machine learning model to predict personality traits</h3> <p>Armed with our new features we can come back and try to build a model that finally will do what I promised: namely predict the user's traits based solely on those factors. What I am not showing here, is the experimentation of choosing the right model for the job. After trying out a few models like <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">Linear Regression</a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">Lasso</a> or <a href="http://scikit-learn.org/stable/modules/tree.html">decision trees</a> the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html">LassoLars Model</a> with cross validation worked quite well. In all the approaches I’ve split the data into a test (90% of the data) and training set (10% of the data), to be able to compute the accuracy of the models on unseen data. I also applied some poormans hyperparameter tuning, where all the predictions are going through different variants of k of the SVD dimensionality reduction. </p> <pre><code class="language-python">#training and testing the model out = [] out.append(["k","trait","mse","r2","corr"]) for k in [2,5,10,20,30,40,50,60,70,80,90,100]: print("Hyperparameter SVD dim k: %s" % k) svd = TruncatedSVD(n_components=k) sparse_matrix_svd = svd.fit_transform(sparse_matrix) df_svd = pd.DataFrame(sparse_matrix_svd) df_svd.index=userst["userid"] total = 0 for target in ["ope","con","ext","agr","neu"]: y = userst[target] y.index = userst["userid"] tmp = pd.concat([y,df_svd],axis=1) data = tmp[tmp.columns.difference([target])] X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.1) clf=LassoLarsCV(cv=5, precompute=False) clf.fit(X_train,y_train) y_pred = clf.predict(X_test) mse = mean_squared_error(y_test,y_pred) r2 = r2_score(y_test,y_pred) corr = pearsonr(y_test,y_pred)[0] print(' Target %s Corr score: %.2f. R2 %s. MSE %s' % (target,corr,r2,mse)) out.append([k,target,mse,r2,corr]) print(" k %s. Total R2 %s" % (k,total))</code></pre> <p>To see which amount of dimensions gave us the best results we can simply look at the printout or visualize it nicely with seaborn below. In the case above I found that solutions with 90 dimensions gave me quite good results. A more production ready way of doing is can be done with <a href="http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html">GridSearch</a>, but I wanted to keep the amount of code for this example minimal. </p> <pre><code class="language-python">#visualizing hyperparameter search gf = pd.DataFrame(columns=out[0],data=out[1:-1]) g = sns.factorplot(x="k", y="r2", data=gf,size=10, kind="bar", palette="muted")</code></pre> <img src="https://liip.rokka.io/www_inarticle/dda61b/hyper.png" alt="results" format="png"> <h3>Step 4: Results</h3> <p>So now we can finally look at how the model performed on each trait, when using 90 dimensions in the SVD. We see that we are not great at explaining the user’s traits. Our <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination">R^2 score</a> shows that for the most traits we can explain them only roughly between 10-20% using Facebook Likes. Among all traits openness seems to be the most predictable attribute. It seems to make sense, as open people would be willing to share more of their likes on facebook. While for some of you it might feel a bit disappointed that we were not able to predict 100% of the psychological traits for a user, you should know that in a typical <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927808/">re-test</a> of psychological traits, researchers are also only are able to explain roughly 60-70% of the traits again. So we did not too bad after all. Last but not least, it's worth mentioning that if we are able to predict roughly 10-20% per trait times 5 traits, overall we know quite a bit about the user in general. </p> <img src="https://liip.rokka.io/www_inarticle/8fdbf2/results-lasso.png" alt="results" format="png"> <h3>Conclusion or what's next?</h3> <p>From of my point of view I think this small example shows two things: </p> <p>Firstly, we learned that it is hard to predict our personality traits well from JUST Facebook likes. In this example we were only able to predict a maximum 20% of a personality trait (in our case for openness). While there are of course myriads of ways to improve our model (e.g. by using more data, different types data, smarter features, better methods) quite a bunch of variance might be still remain unexplained - not too surprising in the area of psychology or social sciences. While I have not shown it here, the easiest way to improve our model would be simply to allow demographic attributes as features too. So knowing a users age and gender would allow us to improve roughly 10-20%. But this would then rather feel like one of these old-fashioned approaches. What we could do instead is to use the user's likes to predict the their gender and age ; but lets save this for another blog post though. </p> <p>Secondly, we should not forget that even knowing a person's personality very well, might not transfer into highly effective political campaigns that would be able to swing a users vote. After all that is what CA promised to their customers, and that's what the media is fearing. Yet the general approach to run political campaigns in the same way such as marketing campaigns is probably here to stay. Only the future will show how effective such campaigns really are. After all Facebook is not an island: Although you might see Facebook ads showing your candidate in the best light for you, there are still the old fashioned broadcast media (that still makes the majority of consumption today), where watching an hour interview with your favorite candidate might shine a completely different light on him. I am looking forward to see if old virtues such as authenticity or sincerity might not give a better mileage than personalized facebook ads.</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/GxOs8WBMm9M" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/the-facebook-scandal-or-how-to-predict-psychological-traits-from-facebook-likes 3D Easter Bunny Making-of http://feeds.liip.ch/~r/liip_blog/~3/dY660Lfr5sU/3d-easter-bunny-making-of https://www.liip.ch/en/blog/3d-easter-bunny-making-of Fri, 06 Apr 2018 00:00:00 +0200 <h3>Conceptual challenges</h3> <p>For us as Designers the biggest challenge was to design the bunny out of the blue. Since we didn’t have a concept made by a dedicated 2D Artist, there was a lot of trying out to find its ultimate cuteness.</p> <h3>Design Process</h3> <p>The usual way of designing a 3D Model, is to create a 2D model first, usually by a concept artist. That model can either be a sketch or a detailed drawing. Subsequently, the model then goes through a feedback loop, is refined and approved. At that stage, the 3D artist then models it. Thus, dividing the whole process into three major steps.</p> <p>Just like when we design web pages, our UX designers start by crafting mockup screens, followed by a frontend developer that assesses their feasibility and implements them afterwards. Professional 3D artists in big studios however, are generally given a concept drawing, that is detailed enough for them to simply “copy” or rather translate it into 3D.<br /> Pedro Couto – our main 3D artist at Liip – however, was given free reign in creativity, doing the entire form-finding process directly in 3D. The challenging part was to build everything from his imagination, without a 2D model, but ultimately lead to a great result, we believe.</p> <h3>Decision to go straight into 3D</h3> <p>Practice over theory is one of our core principles at Liip. As such, we worked with compasses over maps or creativity over fully defined concept art this time. Starting to model in 3D directly led to several iterations of the bunny and allowed Pedro to further hone his modeling skills. Furthermore, it enabled him to explore new ways to efficiently draft objects in 3D. </p> <p>Pedro and I as a UX designer and Illustrator worked closely together to further enhance Hazel’s cuteness factor and to align its visual appearance with Liip’s branding guidelines.</p> <h3>Evolution of the bunny</h3> <p>It takes a lot of effort to build things in 3D. Here you see some of the several stages Hazel went through. </p> <p>Step 1: Sculpting the first draft</p> <img src="https://liip.rokka.io/www_inarticle/844689/bunny-1.png" alt="bunny-1" format="png"> <p>Step 2: Joining all the blocks, plus further sculpting. Hazel was in need for some arms too…</p> <img src="https://liip.rokka.io/www_inarticle/a0db87/bunny2.png" alt="bunny2" format="png"> <p>Last but not least: Hazel received a proper Liip branding shower and a cute-effect facelift</p> <img src="https://liip.rokka.io/www_inarticle/cc4904/bunny3.png" alt="bunny3" format="png"><img src="http://feeds.feedburner.com/~r/liip_blog/~4/dY660Lfr5sU" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/3d-easter-bunny-making-of Mannar : from Aleppo to Bern http://feeds.liip.ch/~r/liip_blog/~3/Z7CgUNDU6aY/mannar-from-aleppo-to-bern https://www.liip.ch/en/blog/mannar-from-aleppo-to-bern Thu, 05 Apr 2018 00:00:00 +0200 <p><em>I met Mannar the first time last summer in our Bern office. She is one of our trainee from the Powercoders program.<br /> I get the chance to discuss with her over a coffee when I travel to Bern. It’s time you get to know her too. She is pretty inspiring to me, we are lucky to have her among us!<br /> Thank you Mannar for sharing your experience and being so sincere !</em><br /> NB: Header picture credit: Marco Zanoni</p> <h2>How did you end up with us?</h2> <p><em>Mannar:</em> The journey to Switzerland in short, from Syria i escaped to Turkey then from Turkey by land all the way to Switzerland with several transportation methods, a truck container was one of them.<br /> I first entered Switzerland on the last days of Dec 2015, i applied for asylum in Canton Waadt, after which i was transferred to Niedergösgen in canton Solothurn where i still live at the moment, there through more than one person i heard about a Coding bootcamp for refugees taking place in Bern (called Powercoders), i immediately applied, got accepted, then went through the 3 months program, through Powercoders i was introduced to several IT companies in Bern, LIIP was one of them.</p> <h2>Why Bern?</h2> <p><em>Mannar:</em> I like Pandas, so i thought i must go to the city of bears maybe raising a little Panda at home is legal there, haha .. a joke! Because the first bootcamp of Powercoders was in Bern :)</p> <h3>Did you already speak German?</h3> <p><em>Mannar:</em> Nope, i invested the last 2 years in Switzerland to study it, i’m currently doing B2 German course. And by the way i’d love to learn French one day, i’ll have to find a sponsor first :)</p> <h3>What it is like to arrive here and apply for asylum?</h3> <p><em>Mannar:</em> It was a new country, new culture, it’s natural one would feel the unknown and overwhelmed all over, especially those who can not speak a mutual language, luckily i already spoke English, it didn’t take me long to orient myself and figure out my way around how things work in Switzerland. For my luck as well i already knew how to use technology as i have studied computer engineering back in Syria. Which was a huge plus in a land where almost everything is digitalized.<br /> I’m here since 2 years, i still hold the Permit N, which means i’m still an asylum seeker, chances that the N holder get a job is impossible, same applies for getting an apprentice, here came Powercoders again and plead SEM for an exception to enable me to start my internship at LIIP, it was successfully issued.</p> <p>Thank you Christian Hirsig, Sunita Asnani, Hannes Gassert, Marco Jakob, and all the teachers who were so kind and patient with out late attendance and poor concentration, sometimes :)</p> <h2>Why did you apply to powercoders?</h2> <p><em>Mannar:</em> Back in Syria, after high school i studied 2 years computer engineering, after graduation i developed my graphic design knowledge and could do some freelancing works, as for coding i studied the basics of programming in the college, but never had the chance to go beyond that or to write codes myself. Powercoders came in the right time, they offered to teach web development, with all computers, teachers, projects, support, networking food cookies all for free, and most importantly a chance for internships later. Who would say No to all of these? I would just say yes just for the cookies! Just kidding :)<br /> My goal was to be independent as soon as possible from the social help, to get a job (a cool exciting one that i enjoy doing and could be creative at), earn money enough to stand on my own feet.</p> <p>A part of Powercoders program is to get us (the student who done the bootcamp) chances to do internship at IT companies in Bern, which i think was the greatest gift ever.<br /> Through a “Career day” where companies met students, i was introduced to several IT companies, eventually i settled down for LIIP. </p> <h3>How did it go?</h3> <p><em>Mannar:</em> At first learning coding was overwhelming, lots of web technologies to learn, my German was not as good as now, i lacked team integration as well.<br /> But then after the German course was out of the way, my concentration was set and things started to have a pattern, goals were defined and evaluated weekly with my onboarding coach. during the weeks i was learning and applying what i learnt on small tasks my colleagues gave me. Once the task is finished, it gets evaluated then i move to the next one, and so on.<br /> Did you have an onboarding buddy? Yes I have an onboarding buddy/coach who still helps me on the vocational level and on the integrational level. Simone Wegelin is a irreplaceable support during this journey. All of the work colleagues are friendly and generous. They’ve become a second family to me.</p> <h3>Did you feel shy?</h3> <p><em>Mannar:</em> Not shy, rather disconnected at first, specially I had to visit a German course every morning before work, and the fact that I travel daily from my village to Bern (1.5 hour). All of it was new and tiring.</p> <h2>How does it go with coding?</h2> <p><em>Mannar:</em> I forgot to mention that I’m learning a UX design here as well, not only coding.<br /> Coding is finding solutions to problems, it's cracking puzzles which luckily i’ve always liked, yet it was (and still is) challenging to learn to code. There are days when i feel very disappointed with myself for not being able to solve a simple exercise, or to style few elements, then I’d put myself in this tense state of mind to insist to solve it, and I fail and push to only fail again then i get more angry on myself... There is a psychological inner battle that comes with it, so it’s not only about learning to code, but also learning how to deal with failure.<br /> On the other hands there are those moment when I feel I could be the next coding star female in town! (it’s mostly when i solve a good exercise on Codewars).</p> <h2>What’s the main difference between now at Liip and your previous job?</h2> <p><em>Mannar:</em> In Syria i was employed at a governmental association, where basically i did data entry and some web content management. It wasn’t exciting for me, rather limited and outdated. I didn’t feel I belonged there, Why I got me into this boring job then, you may ask? Because it was the only job available at that time. I couldn’t take my time to do a further education and learn Web development, I had to make money fast one way or another.<br /> Here i have to mention that there is no social help back in Syria, you don’t work, you end up in the street. Government didn’t give a shit. There was no solid health insurance system for employees, and by the end of the month the salary was only enough to get the basic life requirements, every family member must work and earn to save the ship from drowning. I worked 2 jobs to get me good income.<br /> Syria is an enormously rich country, the textile industry, wood, iron, plastic and food was at its peak, we have incredible petrol and oil productions that could make each and every Syrian citizen live a luxurious life (assuming no corruption and monopolism was there). Yet we lived in indigence and poverty, that was one of the reason the poor people went against the dictatorship. But the rich people who didn’t want to change their lifestyle teamed up with the regime against the poor ones to shut them down. However, the political game in the area has played the biggest role. but we’re not going there now :)</p> <p>On the profession level, the main differences between Switzerland and Syria go down to self-discipline, motivation and taking responsibility, those i truly embrace here, because i learnt them here, now i wake up everyday with excitement to go to work to learn and make awesome things. Because i’m doing the things i love to do, i’m not forced to do them i chose it and i enjoy doing it. That was not the case in my previous job in Syria.<br /> I’ll have to mention that in Syria it’s not common for females to work in IT. Most female students get marry before or right after graduation, to resign from the school life and sign in into the housewife position. </p> <h2>What do you miss most?</h2> <p><em>Mannar:</em> I mostly miss privacy, i’ve been living for more than 2 years in a shared house with another family who has 2 babies and lots of visitors, a shared kitchen and bathroom, I miss the quiet and solitude. I’m not allowed to change my resident place because of Permit N.<br /> I also miss financial comfort, the Permit N salary is honestly a joke. It changed my lifestyle as I always must consider every Frank I spend, as simple as it might sound to some people, but I miss buying things that are not from Second hand shops, to buy meals from take-away, to visit a gym..etc<br /> I just hope to get B Permit soon to be able to work and earn well. Permit B would also gives me the right to travel and see my family who I miss the most.</p> <img src="https://liip.rokka.io/www_inarticle/83e1f8/mannar-and-isaline.jpg" alt="mannar-and-isaline"> <p><em>Mannar likes challenges and always has an objective in sight! :)</em></p> <h2>What’s your next step?</h2> <p><em>Mannar:</em> My next step will be determined by the decision issued from SEM, either a negative decision Permit F (provisionally admitted foreigners) or a positive one B permit (Resident foreign nationals).<br /> I must consider both possibilities, with B Permit things like employment, changing address and family reunion can be smoothly achieved, with F they could be done as well, but as slow as 5 years minimum.<br /> So there is the highway plan (with B):<br /> first thing I will do is to buy my own postpaid SIM card! Seriously.<br /> then I would want to do an apprenticeship of Informatics at Liip, get a Diploma of Informatics, apply for a job, only then I can be independent from the social help. And rent of my own! finally!</p> <p>And there is the bicycle track plan (F), and to fulfill my plan I must get ready to go through a long progress and bureaucratic procedures and expect refusal at any step, one should be patient and keep failing until it works.</p> <p>Thank you Isaline for giving me the chance to be heard (or read), I hope someone from SEM is reading this while he or she is having a good mood. ^_^</p> <p>---------</p> <p><strong>Update:</strong> Since we wrote this blogpost, Mannar received her Permit B ! It means that she is allowed to stay, work, study, rent in Switzerland as well as to travel. She has now the right to sign a contract (like a mobile phone). She will start her apprenticeship at Liip in August!<br /> <a href="https://www.linkedin.com/in/mannar-hielal/">Follow Mannar on LinkedIn</a><br /> ---------</p> <p>Check <a href="http://powercoders.org/">Power Coders</a> Don't hesitate to contact them if you wish to get involved as a coach or if you want to welcome a trainee, or just if you want more information :)<br /> The programm is currently starting in Suisse Romande too!</p> <p>---------</p><img src="http://feeds.feedburner.com/~r/liip_blog/~4/Z7CgUNDU6aY" height="1" width="1" alt=""/> https://www.liip.ch/en/blog/mannar-from-aleppo-to-bern