Example Usage¶
In this example some simple processing is performed on the example.csv
file.
Even more code examples
attached to the project.
Add DataFusion as a dependency¶
Find latest available Datafusion version on DataFusion’s
crates.io page. Add the dependency to your Cargo.toml
file:
datafusion = "31"
tokio = "1.0"
Add DataFusion latest codebase as a dependency¶
Cargo supports adding dependency directly from Github which allows testing out latest DataFusion codebase without waiting the code to be released to crates.io according to the DataFusion release schedule
datafusion = { git = "https://github.com/apache/arrow-datafusion", branch = "main"}
Also it works on the package level
datafusion-common = { git = "https://github.com/apache/arrow-datafusion", branch = "main", package = "datafusion-common"}
And with features
datafusion = { git = "https://github.com/apache/arrow-datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] }
More on Cargo dependencies
Run a SQL query against data stored in a CSV:¶
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let ctx = SessionContext::new();
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
// create a plan to run a SQL query
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
// execute and print results
df.show().await?;
Ok(())
}
Use the DataFrame API to process data stored in a CSV:¶
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// create the dataframe
let ctx = SessionContext::new();
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
let df = df.filter(col("a").lt_eq(col("b")))?
.aggregate(vec![col("a")], vec![min(col("b"))])?
.limit(0, Some(100))?;
// execute and print results
df.show().await?;
Ok(())
}
Output from both examples¶
+---+--------+
| a | MIN(b) |
+---+--------+
| 1 | 2 |
+---+--------+
Arrow Versions¶
Many of DataFusion’s public APIs use types from the
arrow
and parquet
crates, so if you use
arrow
in your project, the arrow
version must match that used by
DataFusion. You can check the required version on DataFusion’s
crates.io page.
The easiest way to ensure the versions match is to use the arrow
exported by DataFusion, for example:
use datafusion::arrow::datatypes::Schema;
For example, [DataFusion 25.0.0
dependencies] require arrow
39.0.0
. If instead you used arrow
40.0.0
in your project you may
see errors such as:
mismatched types [E0308] expected `Schema`, found `arrow_schema::Schema` Note: `arrow_schema::Schema` and `Schema` have similar names, but are actually distinct types Note: `arrow_schema::Schema` is defined in crate `arrow_schema` Note: `Schema` is defined in crate `arrow_schema` Note: perhaps two different versions of crate `arrow_schema` are being used? Note: associated function defined here
Or calling downcast_ref
on an ArrayRef
may return None
unexpectedly.
Identifiers and Capitalization¶
Please be aware that all identifiers are effectively made lower-case in SQL, so if your csv file has capital letters (ex: Name
) you must put your column name in double quotes or the examples won’t work.
To illustrate this behavior, consider the capitalized_example.csv
file:
Run a SQL query against data stored in a CSV:¶
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let ctx = SessionContext::new();
ctx.register_csv("example", "tests/data/capitalized_example.csv", CsvReadOptions::new()).await?;
// create a plan to run a SQL query
let df = ctx.sql("SELECT \"A\", MIN(b) FROM example WHERE \"A\" <= c GROUP BY \"A\" LIMIT 100").await?;
// execute and print results
df.show().await?;
Ok(())
}
Use the DataFrame API to process data stored in a CSV:¶
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// create the dataframe
let ctx = SessionContext::new();
let df = ctx.read_csv("tests/data/capitalized_example.csv", CsvReadOptions::new()).await?;
let df = df
// col will parse the input string, hence requiring double quotes to maintain the capitalization
.filter(col("\"A\"").lt_eq(col("c")))?
// alternatively use ident to pass in an unqualified column name directly without parsing
.aggregate(vec![ident("A")], vec![min(col("b"))])?
.limit(0, Some(100))?;
// execute and print results
df.show().await?;
Ok(())
}
Output from both examples¶
+---+--------+
| A | MIN(b) |
+---+--------+
| 2 | 1 |
| 1 | 2 |
+---+--------+
Extensibility¶
DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
User Defined Functions (UDFs)
User Defined Aggregate Functions (UDAFs)
User Defined Table Source (
TableProvider
) for tablesUser Defined
Optimizer
passes (plan rewrites)User Defined
LogicalPlan
nodesUser Defined
ExecutionPlan
nodes
Optimized Configuration¶
For an optimized build several steps are required. First, use the below in your Cargo.toml
. It is
worth noting that using the settings in the [profile.release]
section will significantly increase the build time.
[dependencies]
datafusion = { version = "22.0" }
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
snmalloc-rs = "0.3"
[profile.release]
lto = true
codegen-units = 1
Then, in main.rs.
update the memory allocator with the below after your imports:
use datafusion::prelude::*;
#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
Ok(())
}
Finally, in order to build with the simd
optimization cargo nightly
is required.
rustup toolchain install nightly
Based on the instruction set architecture you are building on you will want to configure the target-cpu
as well, ideally
with native
or at least avx2
.
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release
Enable backtraces¶
By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error,
like error backtrace. To enable a backtrace you need to add Datafusion backtrace
feature to your Cargo.toml
file:
datafusion = { version = "31.0.0", features = ["backtrace"]}
Set environment variables
RUST_BACKTRACE=1 ./target/debug/datafusion-cli
DataFusion CLI v31.0.0
❯ select row_numer() over (partition by a order by a) from (select 1 a);
Error during planning: Invalid function 'row_numer'.
Did you mean 'ROW_NUMBER'?
backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13
3: std::backtrace::Backtrace::capture
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9
4: datafusion_common::error::DataFusionError::get_back_trace
at /arrow-datafusion/datafusion/common/src/error.rs:436:30
5: datafusion_sql::expr::function::<impl datafusion_sql::planner::SqlToRel<S>>::sql_function_to_expr
............
Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored