Apache Arrow 6 Improves Support For R and Rust – iProgrammer

Apache Arrow 6 Improves Support For R and Rust

Apache Arrow 6 has been released with improvements to support for R and Rust as well as Arrow Flight. There’s also new support for DataFusion.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

The improvements to the new release start with the addition of bindings for Flight in GLib and Ruby. The team says that while SQL support for Flight hasn’t made it into this release, work is ongoing. Arrow Flight SQL defines a protocol for clients to communicate with SQL databases using Arrow Flight.

In Arrow’s compute layer, a basic in-memory query engine has been implemented and is accessible from the R bindings. The query engine supports operations including filter, project, sort, equality joins, and various aggregations. A wide range of functions have also been added in this version, and type support has been improved for most of the compute functions.

The support for R has been enhanced with a number of major new features in this version, some of which the team has been building up to for several years. In practical terms, there’s more dplyr support, including the ability to carry out grouped aggregation. You can now summarise() on Arrow data, both with or without group_by(). These are supported both with in-memory Arrow tables as well as across partitioned datasets. Most common aggregation functions are supported. In addition to aggregation, Arrow now also supports all of dplyr’s mutating joins (inner, left, right, and full) and filtering joins (semi and anti).

The R team has also added support for DuckDB as a way to query Arrow Datasets. This means you can use duckdb’s dbplyr methods, as well as its SQL interface, to aggregate data.

Alongside the R improvements, there’s new support for DataFusion. This is an embedded query engine that uses Rust and Apache Arrow to provide a system that the developers say is high performance, easy to connect, easy to embed, and high quality. This release includes a runtime operator metrics collection framework, and object store abstraction for unified access to local or remote storage. The framework includes Hive-style table partitioning support for Parquet, CSV, Avro and Json files, and DataFrame API support for: except, intersect, show, limit and window functions. It also has extensive SQL support, and now passes TPC-H queries 8, 13 and 21.

Apache Arrow 6 is available for download.

 

More Information

Apache Arrow Website

Arrow On GitHub

Related Articles

Apache Arrow 5 Improves Asynchronous Scanner

Apache Arrow 4 Adds New C++ Compute Functions

Apache Arrow Improves C++ Support

Apache Arrow 2 Improves C++ and Rust Support

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Gain New Skills With Udacity

03/11/2021

Udacity is currently offering personalized discounts of up to 75% to both new and existing customers. So what better time to enrol in a Nanodegree program? The problem is choosing among all the option [ … ]


+ Full Story


Move Over To PostgreSQL With Babelfish and MangoDB

16/11/2021

Babelfish and MangoDB are two solutions that move your application workloads from SQL Server and MonoDB respectively to PostgreSQL.


+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info