使用ADF REST连接器读取和转换FHIR数据

Question

我正在尝试使用Azure Data Factory从FHIR服务器读取数据，并将结果转换为Azure Blob存储中的换行符分隔的JSON（ndjson）文件。具体来说，如果您查询FHIR服务器，您可能会得到以下内容：

{
    "resourceType": "Bundle",
    "id": "som-id",
    "type": "searchset",
    "link": [
        {
            "relation": "next",
            "url": "https://fhirserver/?ct=token"
        },
        {
            "relation": "self",
            "url": "https://fhirserver/"
        }
    ],
    "entry": [
        {
            "fullUrl": "https://fhirserver/Organization/1234",
            "resource": {
                "resourceType": "Organization",
                "id": "1234",
                // More fields
        },
        {
            "fullUrl": "https://fhirserver/Organization/456",
            "resource": {
                "resourceType": "Organization",
                "id": "456",
                // More fields
        },

        // More resources
    ]
}

基本上是一堆资源。我想将其转换为换行符（又名ndjson）文件，其中每一行只是资源的json：

{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources

我能够设置REST连接器，它可以查询FHIR服务器（包括分页），但无论我尝试什么，我似乎无法生成我想要的输出。我设置了Azure Blob存储数据集：

{
    "name": "AzureBlob1",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "type": "AzureBlob",
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "setOfObjects"
            },
            "fileName": "myout.json",
            "folderPath": "outfhirfromadf"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

并配置复制活动：

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Copy Data1",
                "type": "Copy",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "RestSource",
                        "httpRequestTimeout": "00:01:40",
                        "requestInterval": "00.00:00:00.010"
                    },
                    "sink": {
                        "type": "BlobSink"
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "schemaMapping": {
                            "resource": "resource"
                        },
                        "collectionReference": "$.entry"
                    }
                },
                "inputs": [
                    {
                        "referenceName": "FHIRSource",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "AzureBlob1",
                        "type": "DatasetReference"
                    }
                ]
            }
        ]
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

但最后（尽管配置了模式映射），blob中的最终结果始终只是从服务器返回的原始包。如果我将输出blob配置为逗号分隔文本，我可以提取字段并创建展平的表格视图，但这不是我想要的。

任何建议将不胜感激。

Answer 1

正如评论中简要讨论的那样，Copy Activity除了映射数据外没有提供太多功能。如文档中所述，Copy activity执行以下操作：

从源数据存储中读取数据。

执行序列化/反序列化，压缩/解压缩，列映射等。它根据输入数据集，输出数据集和复制活动的配置执行这些操作。

将数据写入接收器/目标数据存储。

除了有效地复制周围的东西之外，它看起来并不像Copy Activity做任何其他事情。

我发现工作的是使用Databrick。

以下是步骤：

在您的订阅中添加Databricks帐户;
单击创作按钮转到Databricks页面;
创建一个笔记本;
编写脚本（Scala，Python或.Net was recently announced）。

该脚本将如下：

从Blob存储中读取数据;
根据需要过滤掉并转换数据;
将数据写回Blob存储;

您可以从那里测试脚本，一旦准备就绪，您可以返回管道并创建一个Notebook activity，它将指向包含脚本的笔记本。

我在Scala中编码很困难但是值得:)

Answer 2

所以我找到了一个解决方案。如果我执行原始步骤转换将捆绑包简单地转储到JSON文件中，然后从JSON文件转换到我假装为另一个blob的文本文件，我可以创建njson文件。

基本上，定义另一个blob数据集：

{
    "name": "AzureBlob2",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "type": "AzureBlob",
        "structure": [
            {
                "name": "Prop_0",
                "type": "String"
            }
        ],
        "typeProperties": {
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ",",
                "rowDelimiter": "",
                "quoteChar": "",
                "nullValue": "\\N",
                "encodingName": null,
                "treatEmptyAsNull": true,
                "skipLineCount": 0,
                "firstRowAsHeader": false
            },
            "fileName": "myout.json",
            "folderPath": "adfjsonout2"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

请注意，这一个TextFormat并且还注意到quoteChar是空白的。如果我然后添加另一个复制活动：

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Copy Data1",
                "type": "Copy",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "RestSource",
                        "httpRequestTimeout": "00:01:40",
                        "requestInterval": "00.00:00:00.010"
                    },
                    "sink": {
                        "type": "BlobSink"
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "schemaMapping": {
                            "['resource']": "resource"
                        },
                        "collectionReference": "$.entry"
                    }
                },
                "inputs": [
                    {
                        "referenceName": "FHIRSource",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "AzureBlob1",
                        "type": "DatasetReference"
                    }
                ]
            },
            {
                "name": "Copy Data2",
                "type": "Copy",
                "dependsOn": [
                    {
                        "activity": "Copy Data1",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": true
                    },
                    "sink": {
                        "type": "BlobSink"
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "columnMappings": {
                            "resource": "Prop_0"
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "AzureBlob1",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "AzureBlob2",
                        "type": "DatasetReference"
                    }
                ]
            }
        ]
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

一切顺利。我认为这不是理想的，因为我现在在blob中有两个数据副本，但我可以轻松删除一个。

如果有人有一步到位的解决方案，我仍然希望听到它。

使用ADF REST连接器读取和转换FHIR数据

问题描述投票：1回答：2

2个回答

最新问题

使用ADF REST连接器读取和转换FHIR数据

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2