Why Apache Spark and what is parallel processing?

Recently at Rakuten we successfully renewed our mainframe from COBOL to Java (an old Fujitsu to Oracle Exalogic) - press release. It was almost a 3 year long project and towards the end, the entire company was working on it. I transitioned into the development team for the mainframe, to rewrite several batches from Cobol to Java.

Now of course I cannot read COBOL (although its not that different). We had a machine translation software convert the cobol to java but honestly, that was hardly anything. The converted code was not smart, it used zero OOP concepts, it used customized stacks to remember function calls (haha, imagine writing your own implementation to remember function calls) and it looked like COBOL just changed its clothes. As a result, these batches performed very slow - their performance times significantly higher than their COBOL counterparts for the same data.

So the job was simple - to make these batches faster, in fact, much faster than their COBOL counterparts, otherwise what's the point?

My team chose Apache Spark as a framework to work parallelly with data. In this post, I am trying to explore why they made this decision and how Spark looks like compared to traditional batch processing.

Background of our batches:


For most of the batch processing we did, there was a fixed pattern to be noticed:

  1. Read data from database and/or file
  2. Process that data for each row of the database table/ each record of the file
  3. Write to database and/or file.

Assume we have an input file which needs to be processed and written into an output file. Assume we have 100 records in the input file.
Traditional Processing
If we were to use traditional processing, we would read the first record, process it, write it to output. Then we would read next record, process it, write to output. We would keep repeating this until all 100 rows. This is a single flow - and as we can guess, it would take us a long time to complete the whole operation. Here's how it can be represented:

Batch Processing

However we already use batch processing. We divide our input records into groups. Let's say groups of 20. That makes a total of 5 groups.
Here we read the first group together (i.e. 20 records), process them, write them to output. Then we would read the next group, process them, write them. We would keep repeating this for all 5 groups. Of course it is much faster to do this than traditional processing, since we save a lot of time switching between tasks and dealing with files. Something like this:

Parallel Processing

However,
What if, we use that same batch processing model as above but with a twist. Let's say you had 5 different machines. You could give each group to one machine. The total time taken would be one fifth that of the batch processing model. This approach is called parallel processing. Roughly, instead of one worker, you have multiple workers who are working in batches:
This is also the approach that Spark takes to process our batches. The working of Spark is slightly different than the diagram above but we will come to the details later, in the next post.

So, why Spark?

In conclusion, we can say that indeed batch processing is much faster than traditional processing. Further, the reason why parallel processing is faster than batch processing is because you operate with multiple groups at the same time instead of just one group at a time. Of course this means we need more resources to achieve this (e.g.: the number of workers required increases in the above diagrams).
Batches which were being run on our mainframe, had very little logic to them. It was about feeding data from one process to another based on a bunch of conditions. Of course what I say is a very very simplified version of how it really looks like. But in summary, there was more work to do in getting the input and writing the output than there was in processing the data in between. Spark fit this choice because, even if we just got rid of the bloated processing of the converted code, we would have practically made no progress with the I/O operation. The logic processing would have pretty much remained the same. Apache Spark blessed us with the power of forgetting about the logic, and concentrating only on the I/O to speed up the processing times. It came with its own challenges, like writing the custom file readers and writers but that's the story of another post!


Everyday Motivation

Nights like these which never cease
When the warm summer air refuses to hug
Tickles of sweat shining on forearms
The noise of the silence across the neighborhood
An array of mess waiting to be cleared
Bullet things to do piling on top of other
No drop of saliva wetting a mum mouth
Dishes lying unwashed, laundry won't do itself
A half eaten dinner won't fill up an empty heart

Misses the mere presence of you

Weather small talks

From
Hibernating under the heater
Dragging through the dusky days
Wriggling out of winter's wrath
Pulling back from depressing emotions
Crawling out of the sleepy snow
And sobbing into the sleepless froids
To
Stepping into sunny springs
Awaiting the pink blossoms
Drinking beer to live a laugh
Finding reasons to go to the beach
Shedding off the excess fur
Dressing up in hats and gowns
Cheers to longer days and warmer spirits!

Immunity

As young kids to learn about roaches
how they've survived
thousands of years
Much before the humans evaded
those days when the dinosaurs thrived
they lived,
in their tiny little bodies with hard rock armors
The ultimate survivors of long lasting famines
when food was scarce, they ate their own
cannibals such
Those pesticides and medicines
they swallowed them all
nothing can destroy them
remnants of repellents they are
In need of no homes,
no particular demands
ugly looking creatures, as strong they are
In awe we remained
of the yet scary resistant mortals
mysterious experienced and immune.

As adults to realize
the cockroaches aren't scary no more
they've emerged far more victorious
than science textbooks care to mention
In men lie their horcrux
in their ability to advance
despite worlds falling apart
Shields of ignorance strengthened by recurrence
attention spans too short to notice embellishments
desires so selfish and hearts merciless
the added audacity to take advantage of the weaker rest
Abusing the abstract, defining their own
men of powerful stature lurking in homes
Not found in corners of pipes but
relaxing in their immune chairs.