Pages

Monday, November 14, 2011

Capturing the most recent file in Talend Open Studio

All modern RDBMSs offer the possibilty to limit the row result number of a query, based on a choosen sort criteria. For instance, in PostgreSQL we can use the following standard ANSI syntax:


SELECT * FROM sample_table ORDER BY ts DESC LIMIT 1

in order to obtain only the most recent row (according to the timestamp „ts“).

Similarly, in Oracle we would have written something like:

SELECT * FROM sample_table WHERE ROWNUM=1 ORDER BY TS DESC

Is it possible to implement the same „sampling“ into a Talend Open Studio data flow, using the standard components provided by the IDE? Obviously, yes.

Let´s see how.

In the following example job we begin with tFileList component, scanning the content of a given directory by the last modified date, in descending order (i.e. we begin from the most recent up to the oldest file), providing as output an iterate link. Other combinations are of course possible (by file name in ascending order, file size, etc...).



In the tFileList component is also possible to specify a filemask (in the Java Regular Expression notation), in order to restrict the list of files.


Note: the tFileList runs directly over the filesystem, and therefore offers a really restric set of ordering options. In case we need a custom sort, we should complement our tFileList using a tSortRow component or a tJavaFlex with custom code.


tFlowToIterate.

We must restrict the list of files to only the most recent one. To do it, we need to use a tFlowToIterate component in order to convert the iterate link into a data flow. We use "FileName" and "FilePath" as column names for the file name and complete filesystem path, respectively.

tIterateToFlow.

We can finally "cut" our file list, using the tSampleRow component, specifying "1" as Range number. Since we ordered the file list (currently existing as data flow link) in descending file timestamp, assure that the only remaining row will correspond to the most recent file.

tSampleRow

Converting again the data flow link into an iterate link, we can use the file name and file pah variables in other components. For istance, we could specify this selected "last recent file name" into the "exclude filemask" option subsequent fFileList component, in case we want to obtain the list of all files in the same directory, except the most recent one.

tFlowToIterate.

Using a simple tJava, we print the selected file name to the standard output. Similarly, we could have used the file name global variable into another component, like for istance tFileDelete

The tFileList component ("Advanced Settings" tab) lets us to define an "exclude" filemask , in order to specify a list of files to exclude from the tFileList output iterate link. By specifying a file name, we can restrict the exclude filemask only to that file name, in our example the most recent one. 

No comments:

Post a Comment