如何使用 ADF 检查和比较文件夹 (Datalake) 内的文件名

Question

我的要求是将 Datalake 文件夹中的文件名与 .CSV 文件中的文件名进行比较，如果文件名匹配，那么我想复制这些文件，如果文件名不匹配，那么我想将这些文件名存储在数据湖中的 .CSV 文件。

请帮忙。

Answer 1

您可以通过以下3个步骤来实现要求，即从csv文件和ADLS文件夹中获取文件名，过滤匹配和不匹配的文件名（从文件夹中），最后进行相应的复制操作。

第 1 步：

我使用
```
get metadata
```
活动从 ADLS 文件夹中获取文件名列表（sample1.csv、sample2.csv、sample3.csv、sample4.csv）。创建指向您的文件夹的数据集并使用
```
child items
```
作为字段列表。

enter image description here

和
```
look up
```
从 csv 文件中获取文件名（sample1.csv、sample2.csv、sample5.csv、sample6.csv）。

enter image description here

步骤2

现在使用过滤器活动，获取匹配的文件名。我使用以下内容作为我的项目和过滤条件来获取匹配的文件名：

items- @activity('list of files in folder').output.childItems
condition- @contains(string(activity('filenames present in csv').output.value),item().name)

enter image description here

为了从 ADLS 文件夹中获取不匹配的文件名，我使用了以下项目和过滤条件：

items- @activity('list of files in folder').output.childItems
condition- @not(contains(string(activity('filenames present in csv').output.value),item().name))

enter image description here

第三步：

现在，用于每个活动将每个文件复制到另一个位置。我将第 1 个项目的值用作
```
@activity('getting matching files').output.Value
```
。在其中，我配置了一个复制活动来复制每个项目的当前活动（即文件名）。
我在中创建了一个名为
```
filename
```
的参数。我从复制数据源设置传递了它的值 (@item().name)，如下所示。

enter image description here

现在，对于文件夹中不匹配的文件名，我使用每个并附加变量来创建文件名数组，如
```
["sample3.csv", "sample4.csv"]
```
。每个项目的值为
```
@activity('getting unmatched files').output.Value
```
。
在每个内部，我使用了
```
append variable
```
，其值为
```
@item().name
```
。

enter image description here

现在，我们必须使用文件夹中所有不匹配的文件名创建一个新的 csv 文件。使用复制数据活动，获取一个（带有一些内容。这些内容并不重要，我们只需要一个文件作为源）。
现在添加一个附加列，称为文件名，其动态内容值如下。（确保管道 JSON 中的文件名值与中的相同）

@join(variables('filenames'),'
')

#the values will be joined using newline(\n). 
#Using \n directly in dynamic content would not work as it will be taken as \\n. 
#So change it in pipeline json as in above reference image.

enter image description here