Thursday, July 5, 2018

Import Fixed Width data file into PostgreSQL

As stated in my previous post, loading data file into PostgreSQL is easy with the COPY command.

But, the problem I faced is with loading a data file which has no delimiter!!! Meaning the data file that I wanted to load is a Fixed width format. There is no way in PostgreSQL to load a file without the delimiter. It was easy in Oracle SQLLoader :) but not with PostgreSQL.

There is one way you can load the data by following this link. It uses a very simple technique of:
1] Create a temp table
2] Load the data as is in a single column
3] And then insert into (CTAS) a table by parsing that single column in a select statement to get the fixed width data out by using substr.

Very easy if you have a few number of records to be loaded.

In my case, the data file that I wanted to load has about 250 million records and the above solution is a big NO.

So, this is how I approached to load the data. Very simple :)

The main problem/issue is with Fixed width. So, what if, if I can convert that fixed width into a delimited file??? problem solved. I can use the regular COPY command to load that monster file !!!

First, you have to create the table with all the columns. Since it is fixed width, you will have a dictionary specifying the column and either the start and end position or the length for that column. Use that dictionary and create a table.

Once the table is created, get the length of each column in a single row. I used vim to bring all the rows into a single row (vim command :%j)

After you have all the column widths in a single row use the following gawk command to do the magic of converting converting the fixed width file into a delimited format.

$ gawk '$1=$1' FIELDWIDTHS='10 10 10 5 4 2 3 7 3 10 2 28 4 2 6 8 28 13 2 25 30 30 3 2 4 9 10 1 1 1 1 6 15 1 20 6 1 6 3 2 2 2 2 2 2 1 1 1 1 1 2 2 1 1 8 2 1 1 1 1 1 2 2 3 2 3 3 14 3 1 2 1 1 1 1 2 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 17 3 1 1 1 1 1 7 1 4 4 4 2 3 3 1 2 1 1 1 1 1 1 4 1 1 4 1 3 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 4 4 4 4 4 1 1 1 1 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 8 3 4 3 3 4 3 4 3 4 3 4 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1' OFS='|' fixed_with_data_file.txt > delimited_data_file.txt

That's it. Now you have a file with the delimiter and you can use COPY command to load into your table.

Oh BTW, I used OFS='|' in the command which indicates my preferred delimiter is a pipe. If you want the regular csv then use OFS=, (no need of single quotes around , )


Loading data from datafile into PostgreSQL

Loading data from a data file into PostgreSQL is easy with the command "copy".

Steps involved in loading data:
1] Create a table with the columns in the data file
2] Use COPY to load data.

That is simple. But the challenge arises when the data file has hundreds of columns and you want the table to be created automatically with those many columns presented in the data file.

I found this little function and modified a bit to accommodate any kind of delimiter.

Here is the simple trick to find out the number of columns in the data file (unix):

$> head -1 data_file.txt |grep -o "|" |wc -l

grep -o parameter value is the delimiter used in the data file. In the example above it is pipe (|) delimited.

Here is how to use the function in PSQL:

select load_data_file('actual_table_name', '/data1/schanna/analytics/data_file_456.txt', 188, '|');

Here is the function Code:

CREATE OR REPLACE FUNCTION schanna.load_data_file(target_table text, file_path text, col_count integer, delimiter_char text)
 RETURNS void
 LANGUAGE plpgsql
AS $function$

declare
--Parameters:
--target_table - Name of the table to be
--file_path - Path of the data file to be loaded
--col_count - Number of columns in the data file
--delimiter_char - Delimiter Character used in the data file

iter integer; -- dummy integer to iterate columns with
col text; -- variable to keep the column name at each iteration
col_first text; -- first column name, e.g., top left corner on a data file

begin
    --Set the working schema
    set schema 'schanna';

    -- Create a temporary table to load the data. This will be renamed at the end with the target_table parameter value   
    create table temp_table ();

    -- Loop thru the number of columns in the data file and add columns to the temporary table
    for iter in 1..col_count
    loop
        execute format('alter table temp_table add column col_%s text;', iter);
    end loop;

    -- Load data from data file
    if delimiter_char = ',' then
      execute format('copy temp_table from %L with delimiter '','' quote ''"'' csv ', file_path);
    elsif delimiter_char = '|' then
      execute format('copy temp_table from %L with delimiter ''|'' ', file_path);
    end if;

    iter := 1;
    col_first := (select col_1 from temp_table limit 1);

    -- update the column names based on the first row which has the column names
    for col in execute format('select unnest(string_to_array(trim(temp_table::text, ''()''), '','')) from temp_table where col_1 = %L', col_first)
    loop
        execute format('alter table temp_table rename column col_%s to %s', iter, col);
        iter := iter + 1;
    end loop;

    -- delete the First row which contains the column header
    execute format('delete from temp_table where %s = %L', col_first, col_first);

    -- rename temp table to the name given as parameter, if not blank
    if length(target_table) > 0 then
        execute format('alter table temp_table rename to %I', target_table);
    end if;

end;

$function$