Slow uploads speeds from Alteryx to Cloudera Impala? Try this:

Once data is in Impala, queries are ultra fast. However, getting large amounts of data into Cloudera can be surprisingly slow. Loading data directly into a table using an ODBC connection or uploading a CSV directly to the file system are simple methods, but there is a much faster way: Avro.


Alteryx can upload Avro files directly into HDFS, where you can quickly load the data directly into an Impala table.


Avro files can be compressed, so the total amount of data you send over the network is much smaller than if you worked with a CSV.

So how can we take advantage of the speed of uploading an AVRO file while still being able to automate table creation?


If you are using a file with only a few columns, the easiest method is to create an In-Database connection to Impala and select the write output as ‘Avro’. Alteryx will automate the process for you.


However, there’s a catch. If your table has many columns, you will likely see an error saying ‘Property may not exceed 4,000 characters.’


This is referring to the SQL language Alteryx sends to Impala when automating the creation of the Avro table.


Avro tables contain the table schema directly in the file. It will look something like this:


{

"type": "record",

"name": "Alteryx",

"fields":

[

{"name": "column1_name", “type": [ "null" , "string"] } ,

{"name": "column2_name", "type": [ "null", "double"] },

Etc…


When Alteryx automates the creation of the Avro table, it enters the entire file schema in as a ‘table property’, which is a string literal passed to Impala. However, Impala has a hard cap on how long a string can be – 4 thousand characters.


If you have a table with many columns, which is very common with big data, then you will not be able to use the In-Database method.


But there is another way!*


*There are actually a few ways around this – you can create the schema manually and store it in a separate file on the HDFS, or load multiple tables and then join them to create a single table once they are all in Impala, but there’s a much simpler, faster and stable way.


To accomplish this in Alteryx:


1. Connect your data to a block until done tool.



2. Have the first data stream connect to an output tool. That output should be configured to send an Avro file directly to the HDFS URL your company is using. Make sure to select ‘enable deflate compression’.







3. Connect the second data stream from the block until done tool to a sample tool. Configure the sample tool to select the top 0 rows. We are doing this to remove all the data from the stream while keeping the metadata.




4. Connect the sample tool output to another output tool. Here, use an ODBC connection to create a table in Impala. Even though ODBC connections are by far the slowest, because have only kept the metadata in the data stream, creating the table will only take seconds.




5. In the ‘Post SQL’ configuration option, we will need to enter a few commands to finish automating the process.


a. First, write a “load data inpath” statement referencing the newly loaded Avro file into the newly created text table. Remember that the path name to the Avro file is case sensitive. This will look like:

LOAD DATA INPATH '/your/file/path/avro_filename.avro' INTO TABLE your_database.your_table;


b. Next, alter the format of the text table to an Avro table. This will look like:

ALTER TABLE your_database.your_table set fileformat avro;


c. Then, invalidate the metadata of your newly created Avro table. Doing this forces Impala to recreate the table’s metadata, and because it is now an Avro table, it looks in the Avro file for the file’s schema.

INVALIDATE METADATA your_database.your_table;

d. Finally, as an optional step, insert the data in your avro table into a parquet-formatted table or text-formatted table. I like to do this because querying compressed Avro tables has a lot of computation overhead, while parquet tables do not have this problem. To do this, your code would look like:


CREATE TABLE your_database.your_parquet_table STORED AS PARQUET AS SELECT * FROM your_database.your_table;




You can also put additional commands in the post SQL command, such as computing the stats for the new table or dropping the original Avro table.


Until Alteryx introduces support for Parquet files, using Avro files to upload data is a great way to reduce the data transmission time. Setting up an automated workflow to create the Avro table and load it into a Parquet table is simple once you know how to do it.


Ready to Make Analytics Work for You

Learn more about how we can help get your project started

New England Office

275 Grove Street, Suite 2-400

Newton, MA 02466

(617) 340-7740

Tri-State Office

100 Overlook Center, 2nd Floor

Princeton, NJ 08540

(609) 246-0029

Services

  • White LinkedIn Icon
  • White Twitter Icon
  • White Facebook Icon

More

 © Cleartelligence, Inc. 2018

All Rights Reserved