BigData London

Andrei Zhozhin

| 6 minutes

The first time I’ve visited Big Data London was in 2018. Every time it is a very interesting experience to see all major vendors in one place and have a chance to talk to them. The main value for me personally is an ability to talk to experts, ask tricky questions, and receive quick feedback from the engineers who might have developed the products they are presenting. This was the first conference I’ve visited offline in 1,5 years (since the start of the pandemic) and it was a great in-person experience.

Organizers did a big work to provide safety as all participants were required to provide evidence of double COVID-19 vaccination or results of a recent lateral test to be allowed to enter the venue.

I’ll describe some of the products/solutions here. In reality, I’ve met more people and asked more questions :) Unfortunately, I was able to participate only one day out of two, so my reflection would not be very long.

Products

I would be covering different aspects that are interesting for me as an engineer, I don’t want to cover all features and advertise them. Also please note that I’m not an expert in all the tools, some of them I’ve just discovered recently, some I’ve used occasionally, and some I’m using on daily basis. As I have an engineering background I’m automatically thinking about how I would implement something similar with C#/Java/Python/Javascript.

Denodo

denodo logo

I like the general idea of the product:

  • connect to any data source
  • combine any data
  • consume (using reports, dashboards, portals, …)

Technically it breaks the barriers between distributed datasets and business users who need to have a holistic view of what is going on without building a spaceship (as we like very much as developers).

I was very curious how the engineers have implemented a query engine that supports multiple backends (relational databases, no-sql databases, cloud data lakes, applications, and files). I had a pretty deep technical session with an engineer who did a great demo of the tooling around query planning and caching. I’m a big fan of self-service capabilities for developers: I like how it works in Oracle when the developer can simply query all info he needs using SQL itself (like v$ tables). I was wondering if there is something similar that exists in the data virtualization layer to allow development teams to investigate performance issues and fine-tune queries by themselves (without the involvement of the database administrator). All these functionalities exist and are ready for use out of the box using DESC QUERYPLAN <query> statement. Or using the graphical tool: denodo query plan

At this point, I was pretty intrigued by the denodo query engine itself and I would learn more about SQL engines themselves (SQLite looks like a good example to start with).

Another interesting aspect that I was interested in is monitoring capabilities, and denodo here provides a JMX interface with information about a server, data sources, caches, etc. So you can plug in your favourite monitoring tools and observe what is happening with the denodo instance.

Alterix

alteryx logo

I’m not a bit fan of visual programming tools, but I can see the benefits when business users can define data processing themselves as it saves a lot of time and provides instant feedback. On the other hand, when things over time got more complicated visual programming tools could not handle accidental complexity and all diagrams looks like this: alteryx complex workflow

So my personal perception is that Excel is the best tool for business users, but when complexity grows such software should be rewritten using an appropriate set of technologies to control complexity and performance. Unfortunately in Excel, you cannot do easily data processing that’s why we have a lot of tools trying to cover this gap.

Other interesting capabilities (for me as an engineer):

  • Collaboration - it is still not allowing multiple people to work on the same workflow at the same time
  • Visual Diff - a great visual representation of differences between two versions of workflow
  • Data source definitions on alteryx server - all data source connections are configured in a single place and users don’t need to bother with credentials

I’ve used Alteryx for relatively simple workflows and found it acceptable for such use-cases, but for complex ones, I would rather write scripts in python with tests as it gives me more confidence in the final result and provides better experience from debugging/troubleshoot/performance perspective.

Dataiku

dataiku logo

Dataiku is the world’s leading platform for Everyday AI, systemizing the use of data for exceptional business results. I consider it as a visual programming tool with a focus on business users and data engineers. Complex workflows also do not look simple: dataiku complex workflow

Other key capabilities (for me as an engineer):

  • Diff - no visual presentation, code level only
  • Integrated source control - a very important feature as you don’t need to handle your files - they are automatically checked out and committed back to your repository (with support of branches).
  • Collaboration of engineers and data analysts - multiple members could work on different parts of pipelines at the same time
  • Packaging flows and deployment to air-gapped environments - once you’ve designed and tested everything on your DEV end you can package everything into a versioned artefact and move it to a protected environment (manually or with API).

I would need to try to use Dataiku as the presentation does not give a feeling of the tool, but the overall graphical interface looks slicker than Alteryx.

Domo

domo logo

BI and Analytics solution. Looks very similar to Tableau and Power BI. I’ve worked with both tools and I can conclude that I prefer Power BI as it covers both ETL and reporting, while Tableau covers only reporting part. Domo features are matching to their competitors.

One of the killer features is the real-time data visualization, both Tableau and Power BI are snapshot-based (that are refreshed on schedule). There is Power BI capability to display real-time data but it is very limited and requires writing backend service that would pre-cook data.

Another cool feature is Magic ETL where business users could extract/transform/join data from different data sources without writing code.

domo dashboard

Summary

The conference is a great place to discover new things, talk to experts, and meet great people. I can highly recommend visiting at least a couple of professional conferences per year to be up to date with what is happening in the industry and extend your professional network.

Previous post

Taming Pandas

Next post

H(ai)ckathon

Related content

BigData London 2022
Elliptic curve cryptography