Thursday, July 5, 2018

Import Fixed Width data file into PostgreSQL

As stated in my previous post, loading data file into PostgreSQL is easy with the COPY command.

But, the problem I faced is with loading a data file which has no delimiter!!! Meaning the data file that I wanted to load is a Fixed width format. There is no way in PostgreSQL to load a file without the delimiter. It was easy in Oracle SQLLoader :) but not with PostgreSQL.

There is one way you can load the data by following this link. It uses a very simple technique of:
1] Create a temp table
2] Load the data as is in a single column
3] And then insert into (CTAS) a table by parsing that single column in a select statement to get the fixed width data out by using substr.

Very easy if you have a few number of records to be loaded.

In my case, the data file that I wanted to load has about 250 million records and the above solution is a big NO.

So, this is how I approached to load the data. Very simple :)

The main problem/issue is with Fixed width. So, what if, if I can convert that fixed width into a delimited file??? problem solved. I can use the regular COPY command to load that monster file !!!

First, you have to create the table with all the columns. Since it is fixed width, you will have a dictionary specifying the column and either the start and end position or the length for that column. Use that dictionary and create a table.

Once the table is created, get the length of each column in a single row. I used vim to bring all the rows into a single row (vim command :%j)

After you have all the column widths in a single row use the following gawk command to do the magic of converting converting the fixed width file into a delimited format.

$ gawk '$1=$1' FIELDWIDTHS='10 10 10 5 4 2 3 7 3 10 2 28 4 2 6 8 28 13 2 25 30 30 3 2 4 9 10 1 1 1 1 6 15 1 20 6 1 6 3 2 2 2 2 2 2 1 1 1 1 1 2 2 1 1 8 2 1 1 1 1 1 2 2 3 2 3 3 14 3 1 2 1 1 1 1 2 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 17 3 1 1 1 1 1 7 1 4 4 4 2 3 3 1 2 1 1 1 1 1 1 4 1 1 4 1 3 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 4 4 4 4 4 1 1 1 1 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 3 4 3 4 3 4 3 4 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1' OFS='|' fixed_with_data_file.txt > delimited_data_file.txt

That's it. Now you have a file with the delimiter and you can use COPY command to load into your table.

Oh BTW, I used OFS='|' in the command which indicates my preferred delimiter is a pipe. If you want the regular csv then use OFS=, (no need of single quotes around , )


No comments:

Post a Comment