Apache Flume中的Apache Avro架构验证

Question

在阅读了Apache Flume以及它在处理客户端事件方面提供的好处之后，我决定是时候开始更详细地研究它了。另一个好处似乎是它可以处理Apache Avro对象:-)但是，我很难理解Avro架构如何用于验证收到的Flume事件。

为了更详细地了解我的问题，我在下面提供了代码段;

Avro schema

为了这篇文章的目的，我使用一个示例模式定义一个嵌套的Object1记录与2个字段。

{
  "namespace": "com.example.avro",
  "name": "Example",
  "type": "record",
  "fields": [
    {
      "name": "object1",
      "type": {
        "name": "Object1",
        "type": "record",
        "fields": [
          {
            "name": "value1",
            "type": "string"
          },
          {
            "name": "value2",
            "type": "string"
          }
        ]
      }
    }
  ]
}

Embedded Flume agent

在我的Java项目中，我目前正在使用Apache Flume嵌入式代理，如下所述;

public static void main(String[] args) {
    final Event event = EventBuilder.withBody("Test", Charset.forName("UTF-8"));

    final Map<String, String> properties = new HashMap<>();
    properties.put("channel.type", "memory");
    properties.put("channel.capacity", "100");
    properties.put("sinks", "sink1");
    properties.put("sink1.type", "avro");
    properties.put("sink1.hostname", "192.168.99.101");
    properties.put("sink1.port", "11111");
    properties.put("sink1.batch-size", "1");
    properties.put("processor.type", "failover");

    final EmbeddedAgent embeddedAgent = new EmbeddedAgent("TestAgent");
    embeddedAgent.configure(properties);
    embeddedAgent.start();

    try {
        embeddedAgent.put(event);
    } catch (EventDeliveryException e) {
        e.printStackTrace();
    }
}

在上面的示例中，我创建了一个新的Flume事件，其中“Test”被定义为将事件发送到在VM（192.168.99.101）内运行的单独Apache Flume代理的事件主体。

Remote Flume agent

如上所述，我已将此代理配置为从嵌入的Flume代理接收事件。该代理的Flume配置看起来像;

# Name the components on this agent
hello.sources = avroSource
hello.channels = memoryChannel
hello.sinks = loggerSink

# Describe/configure the source
hello.sources.avroSource.type = avro
hello.sources.avroSource.bind = 0.0.0.0
hello.sources.avroSource.port = 11111
hello.sources.avroSource.channels = memoryChannel

# Describe the sink
hello.sinks.loggerSink.type = logger

# Use a channel which buffers events in memory
hello.channels.memoryChannel.type = memory
hello.channels.memoryChannel.capacity = 1000
hello.channels.memoryChannel.transactionCapacity = 1000

# Bind the source and sink to the channel
hello.sources.avroSource.channels = memoryChannel
hello.sinks.loggerSink.channel = memoryChannel

我正在执行以下命令来启动代理程序;

./bin/flume-ng agent --conf conf --conf-file ../sample-flume.conf --name hello -Dflume.root.logger=TRACE,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

当我执行Java项目main方法时，我看到“Test”事件通过以下输出传递到我的记录器接收器;

2019-02-18 14:15:09,998 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 54 65 73 74                                     Test }

但是，我不清楚我应该在哪里配置Avro架构以确保Flume只接收和处理有效事件。有人可以帮我理解我哪里错了吗？或者，如果我误解了Flume如何将Flume事件转换为Avro事件的意图？

除了上述内容之外，我还尝试在更改Avro架构后使用Avro RPC客户端指定直接与我的远程Flume代理通信的协议，但是当我尝试发送事件时，我看到以下错误;

Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a remote message: test
    at org.apache.avro.ipc.Requestor$Response.getResponse(Requestor.java:532)
    at org.apache.avro.ipc.Requestor$TransceiverCallback.handleResult(Requestor.java:359)
    at org.apache.avro.ipc.Requestor$TransceiverCallback.handleResult(Requestor.java:322)
    at org.apache.avro.ipc.NettyTransceiver$NettyClientAvroHandler.messageReceived(NettyTransceiver.java:613)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.apache.avro.ipc.NettyTransceiver$NettyClientAvroHandler.handleUpstream(NettyTransceiver.java:595)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:786)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:458)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:439)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:553)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:471)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:332)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

我的目标是，我能够确保应用程序填充的事件符合生成的Avro架构，以避免发布无效事件。我希望我使用嵌入式Flume代理实现这一点，但如果不可能，那么我会考虑使用Avro RPC方法直接与我的远程Flume代理进行对话。

任何帮助/指导都将是一个很大的帮助。提前致谢。

UPDATE

在进一步阅读后，我想知道我是否误解了Apache Flume的目的。我原本以为这可以用来根据数据/模式自动创建Avro事件，但现在想知道应用程序是否应该负责生成Avro事件，这些事件将根据通道配置存储在Flume中并通过批处理发送接收器（在我的例子中是一个Spark Streaming集群）。

如果以上是正确的，那么我想知道Flume是否需要知道架构或者我的Spark Streaming集群是否最终会处理这些数据？如果需要Flume了解架构，那么请您详细说明如何实现这一点？

提前致谢。

Answer 1

由于您的目标是使用Spark Streaming集群处理数据，因此您可以使用2个解决方案解决此问题

1）使用Flume客户端（使用flume-ng-sdk 1.9.0测试）和Spark Streaming（使用spark-streaming_2.11 2.4.0和spark-streaming-flume_2.11 2.3.0测试），在网络之间没有Flume服务器拓扑结构。

客户端类在端口41416发送Flume json事件

  public class JSONFlumeClient {
    public static void main(String[] args) {
    RpcClient client = RpcClientFactory.getDefaultInstance("localhost", 41416);
    String jsonData = "{\r\n" + "  \"namespace\": \"com.example.avro\",\r\n" + "  \"name\": \"Example\",\r\n"
            + "  \"type\": \"record\",\r\n" + "  \"fields\": [\r\n" + "    {\r\n"
            + "      \"name\": \"object1\",\r\n" + "      \"type\": {\r\n" + "        \"name\": \"Object1\",\r\n"
            + "        \"type\": \"record\",\r\n" + "        \"fields\": [\r\n" + "          {\r\n"
            + "            \"name\": \"value1\",\r\n" + "            \"type\": \"string\"\r\n" + "          },\r\n"
            + "          {\r\n" + "            \"name\": \"value2\",\r\n" + "            \"type\": \"string\"\r\n"
            + "          }\r\n" + "        ]\r\n" + "      }\r\n" + "    }\r\n" + "  ]\r\n" + "}";
    Event event = EventBuilder.withBody(jsonData, Charset.forName("UTF-8"));
    try {
        client.append(event);
    } catch (Throwable t) {
        System.err.println(t.getMessage());
        t.printStackTrace();
    } finally {
        client.close();
    }
  }
}

Spark Streaming Server类侦听端口41416

public class SparkStreamingToySample {
  public static void main(String[] args) throws Exception {
    SparkConf sparkConf = new SparkConf().setMaster("local[2]")
    .setAppName("SparkStreamingToySample");
    JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(30));
    JavaReceiverInputDStream<SparkFlumeEvent> lines = FlumeUtils
    .createStream(ssc, "localhost", 41416);
    lines.map(sfe -> new String(sfe.event().getBody().array(), "UTF-8"))
    .foreachRDD((data,time)->
    System.out.println("***" + new Date(time.milliseconds()) + "=" + data.collect().toString()));
    ssc.start();
    ssc.awaitTermination();
  }
}

2）在+ Spark Streaming（作为Flume Sink）之间使用Flume客户端+ Flume服务器作为网络拓扑。

对于此选项，代码是相同的，但现在SparkStreaming必须指定完整的dns限定主机名而不是localhost，以便在同一端口41416启动SparkStreaming服务器，如果您在本地运行此测试。 Flume客户端将连接到水槽服务器端口41415.现在棘手的部分是如何定义水槽拓扑。您需要同时指定源和接收器才能工作。

请参阅下面的水槽配置

agent1.channels.ch1.type = memory

agent1.sources.avroSource1.channels = ch1
agent1.sources.avroSource1.type = avro
agent1.sources.avroSource1.bind = 0.0.0.0
agent1.sources.avroSource1.port = 41415

agent1.sinks.avroSink.channel = ch1
agent1.sinks.avroSink.type = avro
agent1.sinks.avroSink.hostname = <full dns qualified hostname>
agent1.sinks.avroSink.port = 41416

agent1.channels = ch1
agent1.sources = avroSource1
agent1.sinks = avroSink

你应该用这两个解决方案获得相同的结果，但是回到你的问题是，对于来自Json流的Spark Streaming内容是否真的需要Flume，答案是它取决于，Flume支持拦截器所以在这种情况下它可以用来清理或过滤无效您的Spark项目的数据，但由于您要向拓扑添加额外的组件，它可能会影响性能并且需要比没有Flume更多的资源（CPU /内存）。

Apache Flume中的Apache Avro架构验证

问题描述投票：1回答：1

Avro schema

Embedded Flume agent

Remote Flume agent

UPDATE

1个回答

最新问题

Apache Flume中的Apache Avro架构验证

问题描述 投票：1回答：1

Avro schema

Embedded Flume agent

Remote Flume agent

UPDATE

1个回答

最新问题

问题描述投票：1回答：1