I hope this clears up a lot of the confusion about why the Hive connector is called as such and what it means for Presto to replace the Hive runtime.
In short, if you have existing data that is managed by a Hive metastore, the data and metadata can remain as it is. You can simply point Presto to the Hive metastore and storage instance and begin running blazingly fast analytics queries over your data without using any Hive services other than the metastore.
Presto S3 Hive Connector
Run a query to move data to MinIO
If you haven't read the gentle introduction to the Hive connector post, please read this first before continuing on to this example.
When you start this step, you should see the presto cursor once the startup is complete. It should look like this when it is done:
If you don't see the cursor,please wait a moment for this process to complete.
The first step to understanding the Hive metastore's role in the Hive connector is to run a CTAS (CREATE TABLE AS) query that pushes data from one of the TPC connectors into the hive catalog that points to MinIO. The TPC connectors generate data on the fly so that we can run simple tests like this.
First, run a command to show the catalogs to see the
since these are what we will use in the CTAS query.
You should see that the minio catalog is registered. This is actually a Hive
connector configured under the name
minio to delineate the underlying storage
we are using.
Now, open the MinIO UI and log in using:
Access Key: minio
Secret Key: minio123
Upon logging in, you will see the following screen.
Create a Bucket by clicking (+) button and create bucket.
Name the bucket
tiny as the dataset we will be transferring will be small.
Back in the terminal create the minio.tiny schema. This will be the first call to the metastore to save the location of the S3 schema location in MinIO.
CREATE SCHEMA minio.tiny WITH (location = 's3a://tiny/');
Now that we have a schema that references the bucket where we store our tables in MinIO, we now can create our first table.
Optional: To view your queries run, log into the Presto UI and log in using any username (it doesn't matter since no security is set up).
Move the customer data from the tiny generated tpch data into MinIO uing a CTAS query. Run the following query and if you like, watch it running on the Presto UI:
CREATE TABLE minio.tiny.customer WITH ( format = 'ORC', external_location = 's3a://tiny/customer/' ) AS SELECT * FROM tpch.tiny.customer;
Go back to the MinIO UI, and click under the tiny bucket. You will now see a
customer directory generated from that table and underneath that directory will
be a file with a name comprised of uuid and date. This is the orc file generated
by the presto runtime residing in MinIO.
Now that there is a table under MinIO, you can query this data by checking the following.
SELECT * FROM minio.tiny.customer LIMIT 50;
So the question now is how does Presto know where to find the orc file residing in MinIO when all we specify is the catalog, schema, and table? How does Presto know what columns exist in the orc file, and even the file it is retrieving is an orc file to being with? Find out more in the next step.