Prakash

7 points

Authored Comments

28 Jul 2020

Define and optimize data partitions in Apache Cassandra

Hi.

One of the data analytics company has given me an assignment of creating architecture and explaining them with diagrams. This assignment has two questions. The Q1 is related to choosing right technology and data partitioning strategy using a nosql cloud database. Also reducing the compute time so that entire compute load can finish in few hours.

Coming to Q2. A trucking company deals with lots of invoices(daily 40000). And currently all people can see all the invoices which are not related to them. How would you design a system to store all this data in a cost efficient way. How would you design a authorization system to ensure organizations can only see invoices related only to themselves.

I saw your blog on data partitioning in Cassandra. I think you can help me as you may already be knowing the solution.

Thanks
Prakash Saswadkar
Mumbai, mob: +91-981 941 5206

-- Copy pasted from word doc --
Problem1:

A large fast food chain wants you to generate forecast for 2000 restaurants of this fast food chain. Each restaurant has close to 500 items that they sell.
The fast food chain provides data for last 3 years at a store, item, day level. The ask is provide forecast out for the following year.
Assume the data is static. Data Scientist look at the problem and have figured out a solution that provides the best forecast.
The data scientist have built an algorithm that takes all data at a store level and produce forecasted output at the store level.
It takes them 15 minutes to process each store.

Questions:
1) Given the input data is static. What is the right technology to store the data and what would be the partitioning strategy?
2) Each store takes 15 minutes, how would you design the system to orchestrate the compute faster - so the entire compute can finish this in < 5hrs

Make any assumptions in your way and state them as you design the solution and do not worry about the analytic part. Assume the analytic
part is a black box.

Problem 2:

A trucking company deals with a lot of invoices close to 40,000 a day. Regulatory requirements need 7 years of data to be stored.
A trucker scans the invoice on his mobile device at the point of delivery. Image recognition program scans the invoice and adds
meta information captured from the image. Meta information will include shipped from and shipped to and other information.
The trucking company can see all its invoices, the shipped from organizations can view all invoices whose shipped from matches with theirs,
similar rules apply to shipped to.

Questions:
How would you design a system to store all this data in a cost efficient way.
How would you design a authorization system to ensure organizations can only see invoices based on rules stated above.
What would be the design considerations to make the solution globally available ?
-- --