Populating Cloud Data Warehouses with Apache NiFi


Constantin, a Master Data Engineer pointed out that JSON, Parquet or Avro are the most used semi-structured data types, but some Enterprise Applications still produce XML!  Parsing XML is expensive,  Ingesting XML is challenging and Large XML files are hard to chunk for parallel ingest.  We still need this data and don't want to write custom code to parse it.   So let's break it down and ingest this data at scale.

Source:   https://github.com/tspannhw/clouddatawarehouse

Ingest Raw Data:

<?xml version='1.0' encoding='utf-8'?><rss xmlns:dc='http://purl.org/dc/elements/1.1/' version='2.0'><channel><title>travel.state.gov: Travel Advisories</title><link>travel.state.gov: Travel Advisories</link><description>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories.html</description><item><title>Bhutan - Level 1: Exercise Normal Precautions</title><pubDate>Fri, 25 Jan 2019</pubDate><link>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</link><guid>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</guid><dc:identifier>BT</dc:identifier><description><![CDATA[<p>Exercise normal precautions in Bhutan.&nbsp;</p>]]></description></item>

Apache NiFi Flow to Read US Travel Advisors XML

Let's Ingest Those Advisors For Fast Queries and ML Analysis

Any Data, Anywhere, Anytime

With Apache NiFi you can populate Cloud Data Warehouses with ease directly through JDBC or to S3 buckets from sources including email, REST, Logs, Databases, NoSQL stores, Hadoop, XML, PDF, Word, AVRO, JSON, CSV, Text, Syslog and more.   For this example I am grabbing XML/RSS feeds from public sources on travel advisors around the world for US citizens.   It is easy to route, cleanse, transform and store at scale in our hyperscale Cloud Data Warehouse regardless of cloud.

Source URL:   https://travel.state.gov/_res/rss/TAsTWs.xml

Table Design

Let's create some tables. You should design an ER schema and use either 3NF, Snowflake Schema, Star Schema or Data Vault. Here are two simple tables to contain all the fields we have. We may want to add a primary key and date / timestamp to our travelwarning table as well.

CREATE TABLE TRAVELWARNING ( title VARCHAR,  pubDate VARCHAR,  link VARCHAR,  guid VARCHAR,  identifier VARCHAR,  description VARCHAR );


Connecting to A Cloud Data Warehouse
We just need a JDBC Driver, Driver class name, URL, username, password and download JAR.   Now we can connect for queries or inserts with ease at full speed.   Make sure you download the latest version of the JDBC driver you need and store it somewhere that the Apache NiFi user can use it.  I recommend a special directory for these drivers.  
We can easily use record processing to ingest 1,000s of records and convert them to useful types like JSON or AVRO before inserting them into our data warehouse.
Let's insert rows into our Cloud Data Warehouse on AWS as raw JSON into a VARIANT column using PutSQL.

We can also use Records to easily infer the schema and push to our our regularly defined table using PutDatabaseRecord.

We can now query the data as it is arriving.

Other Ingest Options

We can also land the JSON data in S3 and have that ingested via Snowpipe or bulk load with COPY

Examples of Other Data We Can Ingest

Useful Cloud Data Warehouse Resources