3

I am using Kafka and Zookeeper as the main components of my data pipeline, which is processing thousands of requests each second. I am using Samza as the real time data processing tool for small transformations that I need to make on the data.

My problem is that one of my consumers (lets say ConsumerA) consumes several topics from Kafka and processes them. Basically creating a summary of the topics that are digested. I further want to push this data to Kafka as a separate topic but that forms a loop on Kafka and my component.

This is what bothers me, is this a desired architecture in Kafka?

Should I rather do all the processing in Samza and store only the digested (summary) information to the Kafka from Samza. But the amount of processing I am going to do is quite heavy, that is why I want to use a separate component for it (ComponentA). I guess my question can be generalized to all kind of data pipelines.

So is it a good practice for a component to be a consumer and a producer in a data pipeline?

ralzaul
  • 4,280
  • 6
  • 32
  • 51
  • 2
    "I further want to push this data to Kafka as a separate topic but that forms a loop on Kafka and my component." If you are publishing to a separate topic, how does it create a loop ? Assuming that you are interested in only consuming all topics except the newly created topic. Also, is there any reason why both "ConsumerA" and Samza exist in you architecture ? Given that Samza will perform the same transformation that the ConsumerA is doing ? – Naveen Apr 23 '15 at 23:00
  • @Naveen they do not push the same topic to the kafka but component wise still a loop is formed between ComponentA and kafka. ( ComponentA is a producer and a consumer ). – ralzaul Apr 24 '15 at 06:38
  • about the samza no, i was asking if I should make the processing on Samza rather than on my component. In that case the loop might be avoided but since the amount of processing is high, i want to do it on a second component – ralzaul Apr 24 '15 at 06:39
  • "component wise still a loop is formed" -- I don't get it, why is this a problem? Are you worried about network traffic? – Jon Bringhurst Apr 24 '15 at 17:10
  • That is what i am asking. Can it create a problem? I can't see any case which this might cause any harm but it still feels like it is an architecture to be avoided. – ralzaul Apr 24 '15 at 17:18
  • It's probably too late, but kafka streams API recently introduced caters the same problem – Hemant Patel Jun 09 '17 at 12:24

1 Answers1

1

As long as Samza is writing to different topics than it is consuming from, no, there will be no problem. Samza jobs that read from and write to Kafka are the norm and intended by the architecture. One can also have Samza jobs that bring some data in from another system, or jobs that write some data from Kafka out to a different system (or even jobs that don't use Kafka at all).

Having a job read from and write to the same topic, is, however, where you'd get a loop and to be avoided. This has the potential to fill up your Kafka brokers' disks really fast.

Jakob Homan
  • 2,284
  • 1
  • 13
  • 16