BigData London 2022

Andrei Zhozhin

| 8 minutes

This year I was able to attend both days of the conference and I’ve managed to speak with many exhibitors (in 2021 I had only a single day), so I’m sharing a quick overview of different tech solutions that might be interesting to explore in more detail. These include non-relational databases, developer tools, tools for data science, and some auxiliary technologies & solutions that could help with data management. It was a great time and I was able to learn a lot of new things in a pretty short time frame.

Databases

My main interest was to explore different kinds of specialized databases that could help to solve business problems where relational databases cannot handle complex data models.

InfluxDB (Time-series)

InfluxDB logo

This is a mature (initially released in 2013) open-source time series database is written in Go.

A very interesting technology that could be used to implement data gathering, monitoring, and alerting with a single solution. Open-source version is rock solid and could become core storage for time series data.

Supported data structures: measurements, series, and points. Each point consists of several key-value pairs called the fieldset and a timestamp. Values can be 64-bit integers, 64-bit floating points, strings, and booleans.

Influx line protocol is very simple and accepts data via HTTP, TCP, and UDP:

measurement \
  (,tag_key=tag_val)* \
  field_key=field_val(,field_key_n=field_value_n)*  \
  (nanoseconds-timestamp)?

For example:

weather,location=us-midwest temperature=82 1465839830100400200
  |    -------------------- --------------  |
  |             |             |             |
  |             |             |             |
+-----------+--------+-+---------+-+---------+
|measurement|,tag_set| |field_set| |timestamp|
+-----------+--------+-+---------+-+---------+

TICK stack

TICK stack - set of technologies that allow capture, storing, process, and visualizing of time series data. It could be compared with Elastic stack (originally ELK stack - Elastic Search + Logstash + Kibana) but for primarily time series data.

It consists of the following parts:

  • Telegraph - data gathering capabilities
  • InfluxDB - time series data store
  • Chronograph - real-time visualization
  • Kapacitor - monitoring (with anomaly detection) and alerting capabilities.

Influx TICK stack

I have practical experience building an in-house specialized monitoring system for infrastructure and applications monitoring using InfluxDB 1.x when the TICK stack was not very mature, so I have to implement several components (Agent-less metrics gathering and Data visualizations) to make the whole thing work. Now TICK framework is mature enough and it could be configured to achieve almost the same functionality without writing a lot of custom code.

QuestDB (Time-series + Relational)

QuestDB logo

Fast SQL for time series database. The main feature of this database is data processing speed. In some use cases, it could be much faster than influxdb as developers were focusing on performance from day one.

Interestingly enough that QuestDB supports InfluxDB line protocol, so potentially one can replace one storage with another without any problems.

In addition to time series QuestDB also supports relational data modeling, so it is possible to join both using SQL JOINs between two data models which might be a killer feature.

QuestDB time-series to relational JOIN

QuestDB is using columnar data structure and performs data sorting in memory before writing data to the filesystem that allow very performant reads and writes.

Query language comparison between Flux (InfluxDB data script language) and QuestDB

from(bucket:"example-bucket")
  |> range(start:-1h)
  |> filter(fn:(r) =>
    r._measurement == "cpu" and
    r.cpu == "cpuTotal"
  )
  |> aggregateWindow(every: 1m, fn: mean)

QuestDB (SQL-like syntax is more familiar for data analysts):

SELECT avg(cpu), avg(cpuTotal) FROM 'example-bucket'
WHERE timestamp > dateadd('h', -1, now())
SAMPLE BY 1m

ArangoDB (key-value + document + graph)

ArangoDB logo

ArangoDB is a multi-model database (key-value, document, graph) that includes a full-text search and ranking engine. The database comes with ArangoDB Query Language (AQL) - declarative query language that supports all data models allowing collection traversal, JOINs, search, and geospatial.

Example of document merging between collections

Inserting a document into Characters collection:

INSERT {
    "name": "Ned",
    "surname": "Stark",
    "alive": true,
    "age": 41,
    "traits": ["A","H","C","N","P"]
} INTO Characters

Inserting multiple documents into Traits collection:

LET data = [
    { "_key": "A", "en": "strong", "de": "stark" },
    ...
    { "_key": "C", "en": "loyal", "de": "loyal" },
    ...
    { "_key": "H", "en": "powerful", "de": "einflussreich" },
    ...
    { "_key": "N", "en": "rational", "de": "rational" },
    ...
    { "_key": "P", "en": "brave", "de": "mutig" },
    ...
]

FOR d in data 
    INSERT d INTO Traits

Joining Characters and Traits collections on c.traits property:

FOR c IN Characters
    RETURN MERGE(c, { traits: DOCUMENT("Traits", c.traits)[*].en } )

Result:

[
  {
    "_id": "Characters/2861650",
    "_key": "2861650",
    "_rev": "_V1bzsXa---",
    "age": 41,
    "alive": false,
    "name": "Ned",
    "surname": "Stark",
    "traits": [
      "strong",
      "powerful",
      "loyal",
      "rational",
      "brave"
    ]
  },
]

AQL looks very promising and feels much better than MongoDB JSON syntax for queries with their $lookup operator for joins.

MemGraph (graph)

MemGraph logo

MemGraph is an in-memory graph database written in C++ for real-time streaming data (released in 2017). It supports Cypher graph query language to work with data.

Cypher query language operates on nodes (vertices) and relationships (edges), basic operations:

  • MATCH () for matching nodes
  • MATCH ()-[]->() for matching relationships
  • WHERE for filtering results by using various conditions
  • RETURN for projecting results
MATCH (n:Character)-[e:KILLED]->(m:Character)
WHERE n.name = "Jon Snow" AND e.method != "Knife"
RETURN n, e, m;

Cypher query result

MemGraph supports custom functionality development using C++, Rust, and Python.

Neo4j (graph)

Neo4j logo

Mature (released in 2007) Graph database written in Java that could be scaled in different ways to support different business requirements. Neo4j uses on-disk storage for graph data as the default destination and thus can theoretically support HUGE graphs.

Neo4j also supports Cypher query language.

Neo4j could be considered as the default solution for graph-oriented applications to start with.

Neo4j graph

Key features:

  • Standard graph language CQL - Cypher Query Language
  • Graph Data Model
  • Indexes support (using Apache Lucene)
  • Build-in UI - Neo4j Data Browser
  • Full ACID (Atomicity, Consistency, Isolation, Durability) support
  • REST API for integrations
  • Cypher API and Java API
  • Potentially unlimited scalability

Data Analyst Tools

I have a long history of using JetBrains products for JavaScript (WebStorm), Ruby (RubyMine), Python (PyCharm), Java (Idea), and C# (Rider and ReSharper). This time I was interested to look at tools/solutions for Data Engineers.

Their products are well known within the development community.

I’ve used multiple database tools throughout my career: PhpMyAdmin (MySql), SQL Management Studio (MSSQL Server), PLSQL Developer (Oracle), Toad (Oracle), Sql developer (Oracle), DbVisualizer (any Relational via JDBC).

But I’ve found that DataGrip is the most convenient and flexible. It supports code navigation, schema exploration, and query explain. If you are working with other JetBrains products this is a great compliment to support your productivity. The majority of keyboard shortcuts are the same across all products so it feels very natural when you switch between your IDE and DataGrip. My second favorite relational database tool is DbVisualizer.

Jetbrains DataGrip

DataGrip logo

Jetbrains DataGrip is a database IDE for professional SQL developers. It is cross-platform and suitable for DBAs and developers working with relational databases. It can work with any relational database via the JDBC interface.

DataGrip query console

Key features:

  • Intelligent query console
  • Efficient schema navigation
  • Explain plan (with visualization)
  • Smart code completion
  • Refactorings for SQLs and Schemas
  • PL/SQL Debugger support for Oracle

JetBrains DataSpell

DataSpell logo

IDE is specially created for Data scientists. Supporting Jupyter family: local Jupyter notebooks, JupyterHub, and Jupyter Lab. It feels like a Jupyter notebook but with proper IntelliSense and code navigation.

DataSpell

Key features:

  • Intelligent code navigation
  • Smart code completion
  • Refactorings
  • Debugger support
  • Version control
  • Database Tools to explore data/running queries/alter schemas

If you concentrate more on Development then PyCharm could be the tool to choose while if you are more about Data Science then DataSpell is the right tool.

JetBrains DataLore

DataLore logo

Browser-based collaboration platform for Data Science. It looks similar to Databricks Notebooks but is not bound to any particular execution environment (you can choose wherever you want). But DataLore has a much better developer experience than all other JetBrains products, while Databricks Notebooks has only basic code completion support.

DataLore

Key features:

  • Multiple data integrations
  • Smart coding assistance (Python, SQL, Kotlin, Scala, R)
  • No-code automation
  • Environment manager (isolated or shared environments)
  • Automatic visualizations
  • Collaboration everywhere (notebooks, data, scripts, environments, reports)
  • Build-in version control
  • Live collaborative coding
  • BI apps (create interactive apps from notebooks)

Data Integration tools

Fivetran (ELT)

Fivetran logo

Fivetran is a commercial fully automated connectors allowing syncing data from cloud applications, databases, event logs and more into your data warehouse. It is closed source and mature system offering more than 150 connectors for almost any datasource you might have.

Fivetran ELT

Key features:

  • mature commercial solution
  • 150+ connectors
  • major warehouses and databases as a destination
  • supports post-load transformations via SQL and dbt
  • customizations are possible with cloud functions on serverless platforms (go, java, nodejs, c#, f#)

Airbyte (ELT)

Airbyte logo

Airbyte is an open-source data integration engine that helps you consolidate your data in your data warehouses, lakes, and databases. As it is open source and has a strong community amount of integrations is rapidly growing (200+ connectors within 2 years from inception). If required you can change existing connectors or create new ones very quickly.

Airbyte ELT

It supports integration with Airflow so it is possible to trigger Airbyte ELT jobs from Airflow

Key features:

  • strong open source community
  • 200+ connectors (fully open source), 50% contributed by the community
  • ability to change an existing connector
  • ability to create new connectors using SDK
  • connectors certification and SLA
  • custom post-load transformations with dbt
  • could be hosted anywhere: on-prem and on cloud

This solution looks very promising and well accepted by businesses (more than 25000 companies) around the world already.

Summary

It was a great two days of my life when I managed to talk to many people and learn about their products. I’d like to thank the organizers of the conference and exhibitors for such an awesome experience. I recommend everyone to visit it and speak to experts or listen to presentations.

You can also check my previous article about Big Data London 2021.

Related content

BigData London