tsar2512 February 2016

When using netcat to process logs in spark streaming, it is dropping last few lines?

I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command

tail -f <logfile>| nc -lk 9999

Here the spark streaming service is taking data from port 9999. However, I observe that the last few lines are being dropped, i.e. spark streaming does not receive those log lines or they are not processed.

However, I also observed that if I simply take the logfile as standard input instead of tailing it, no lines are dropped:

nc -q 10 -lk 9999 < logfile

Can anyone explain why this behavior is happening? And what could be a better resolution to the problem of streaming log data to spark streaming instance?

Answers


huitseeker February 2016

In Spark Streaming, data comes in over the wire, and constitutes a block on every block interval. This block is replicated on other machines (according to your storage level as soon as formed. Once a batch interval elapses, each block formed since the last batch interval tick forms part of a new RDD. It is once you have formed this RDD that you can schedule a job, so the data collected during the batch interval n is then processed during batch interval n+1.

So, the possible culprits for "losing a bit of data towards the end" could be:

  • you are observing your input file at the same time as you are monitoring the input for Spark. If you consider your monitoring at instant t, a bit after n batch intervals have elapsed, your log file has produced the data for n batches and then some ("a little bit more"). Except, the beginning of the next batch (n+1) is at this stage in the data collection phase, in the form of blocks on your Receiver. No data has been lost, the processing of batch n+1 has simply not started yet.

  • or your application assumes it's receiving a similar number of elements in each RDD and does not process the potentially (much) smaller last batch's RDD correctly.

  • or you're stopping your application or data before the last batch interval elapses (you need to wait n+1 batch intervals to see the processing of n batches of data).

  • or there is something weird occurring with the system clock of your executors. Have you thought of synchronizing them with

    Post Status

    Asked in February 2016
    Viewed 1,903 times
    Voted 6
    Answered 1 times

    Search




Leave an answer