The SMACK stack is being used for building modern enterprise apps because it performs each of these objectives with a loosely coupled toolchain of technologies that are are all open source, and production-proven at scale.
Spark – A general engine for large-scale data processing, enabling analytics from SQL queries to machine learning, graph analytics, and stream processing
Mesos – Distributed systems kernel that provides resourcing and isolation across all the other SMACK stack components. Mesos is the foundation on which other SMACK stack components run.
Akka – A toolkit and runtime to easily create concurrent and distributed apps that are responsive to messages.
Cassandra – Distributed database management system that can handle large amounts of data across servers with high availability.
Kafka – A high throughput, low-latency platform for handling real-time data feeds with no data loss.
WHAT APACHE SPARK DOES
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.
The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP.
Developing a cluster-agnostic application will be at the forefront of the minds of developers. But with this restriction, it is common that the advantages of using a specific cluster technology will be lost. Developing for a Mesos framework (a framework, from here on) exposes a host of extra features that are not available, and sometimes not possible, when writing your own framework to overlay Mesos.
To summarise, a framework provides an application with an API to monitor the state of the tasks that the application is responsible for. Based upon the state, the application can perform some action. The management is performed by an implementation of the Scheduler interface, and the tasks are performed by implementations of the Executor interface. Mesos will communicate with the Scheduler via the API to inform it of the current state of the cluster and the state of the tasks that have been started. Using this API, it is possible to encode a level of scheduling that would not be possible using a standard orchestration tool like Marathon. (http://mesos.apache.org/documentation/latest/app-framework-development-guide/)
The key use case for a framework is to provide enhanced resilience against failure. For example, take HDFS, which requires a NameNode, DataNode and JournalNode. These need to be started in a certain order and the cluster topology is complex. For HA, two NameNodes and at least three JournalNodes on separate machines are required. Another example could be for a database like Elasticsearch, Riak, Neo4j, Cassandra, etc. For best resiliency, it would be optimal to place only a single instance of a data nodes on each physical machine.
Frameworks also provide other features like authorisation and roles, to allow applications to only use certain resources and reserve resources, and the ability to install applications on Mesos in a single request.
Another reason is to enable application-specific scaling. For example, for a database, scaling the data nodes, not controller nodes, may be important. Frameworks also enable the prospect of auto-scaling, depending on a series of sensors that are only known to Mesos (e.g. RAM/Disk usage).
Finally cluster specific business logic, like when and how backups are performed, or upgrading an installation can only occur seamlessly using the information provided by Mesos. For example, it may not be wise to perform a snapshot backup when tasks are under heavy load, it might be better to scale horizontally first, to add capacity, then backup, then scale back. This complexity would be difficult, if not impossible to capture in a traditional microservices environment.
Introduction to Akka
Welcome to Akka, a set of open-source libraries for designing scalable, resilient systems that span processor cores and networks. Akka allows you to focus on meeting business needs instead of writing low-level code to provide reliable behavior, fault tolerance, and high performance.
Many common practices and accepted programming models do not address important challenges inherent in designing systems for modern computer architectures. To be successful, distributed systems must cope in an environment where components crash without responding, messages get lost without a trace on the wire, and network latency fluctuates. These problems occur regularly in carefully managed intra-datacenter environments - even more so in virtualized architectures.
To help you deal with these realities, Akka provides:
Multi-threaded behavior without the use of low-level concurrency constructs like atomics or locks — relieving you from even thinking about memory visibility issues.
Transparent remote communication between systems and their components — relieving you from writing and maintaining difficult networking code.
A clustered, high-availability architecture that is elastic, scales in or out, on demand — enabling you to deliver a truly reactive system.
Akka’s use of the actor model provides a level of abstraction that makes it easier to write correct concurrent, parallel and distributed systems. The actor model spans the full set of Akka libraries, providing you with a consistent way of understanding and using them. Thus, Akka offers a depth of integration that you cannot achieve by picking libraries to solve individual problems and trying to piece them together.
By learning Akka and how to use the actor model, you will gain access to a vast and deep set of tools that solve difficult distributed/parallel systems problems in a uniform programming model where everything fits together tightly and efficiently.
What is Apache Cassandra?
Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. Cassandra was originally developed at Facebook, was open sourced in 2008, and became a top-level Apache project in 2010.
Key Cassandra Features and Benefits
Cassandra provides a number of key features and benefits for those looking to use it as the underlying database for modern online applications:
Massively scalable architecture – a masterless design where all nodes are the same, which provides operational simplicity and easy scale-out.
Active everywhere design – all nodes may be written to and read from.
Linear scale performance – the ability to add nodes without going down produces predictable increases in performance.
Continuous availability – offers redundancy of both data and node function, which eliminate single points of failure and provide constant uptime.
Transparent fault detection and recovery – nodes that fail can easily be restored or replaced.
Flexible and dynamic data model – supports modern data types with fast writes and reads.
Strong data protection – a commit log design ensures no data loss and built in security with backup/restore keeps data protected and safe.
Tunable data consistency – support for strong or eventual data consistency across a widely distributed cluster.
Multi-data center replication – cross data center (in multiple geographies) and multi-cloud availability zone support for writes/reads.
Data compression – data compressed up to 80% without performance overhead.
CQL (Cassandra Query Language) – an SQL-like language that makes moving from a relational database very easy.
What is Kafka?
Kafka is one of those systems that is very simple to describe at a high level, but has an incredible depth of technical detail when you dig deeper. The Kafka documentation does an excellent job of explaining the many design and implementation subtleties in the system, so we will not attempt to explain them all here. In summary, Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable.
Like many publish-subscribe messaging systems, Kafka maintains feeds of messages in topics. Producers write data to topics and consumers read from topics. Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes.
Messages are simply byte arrays and the developers can use them to store any object in any format – with String, JSON, and Avro the most common. It is possible to attach a key to each message, in which case the producer guarantees that all messages with the same key will arrive to the same partition. When consuming from a topic, it is possible to configure a consumer group with multiple consumers. Each consumer in a consumer group will read messages from a unique subset of partitions in each topic they subscribe to, so each message is delivered to one consumer in the group, and all messages with the same key arrive at the same consumer.
What makes Kafka unique is that Kafka treats each topic partition as a log (an ordered set of messages). Each message in a partition is assigned a unique offset. Kafka does not attempt to track which messages were read by each consumer and only retain unread messages; rather, Kafka retains all messages for a set amount of time, and consumers are responsible to track their location in each log. Consequently, Kafka can support a large number of consumers and retain large amounts of data with very little overhead.
Next, let’s look at how Kafka’s unique properties are applied in a specific use case.
Kafka at Work
Suppose we are developing a massive multiplayer online game. In these games, players cooperate and compete with each other in a virtual world. Often players trade with each other, exchanging game items and money, so as game developers it is important to make sure players don’t cheat: Trades will be flagged if the trade amount is significantly larger than normal for the player and if the IP the player is logged in with is different than the IP used for the last 20 games. In addition to flagging trades in real-time, we also want to load the data to Apache Hadoop, where our data scientists can use it to train and test new algorithms.
For the real-time event flagging, it will be best if we can reach the decision quickly based on data that is cached on the game server memory, at least for our most active players. Our system has multiple game servers and the data set that includes the last 20 logins and last 20 trades for each player can fit in the memory we have, if we partition it between our game servers.
Our game servers have to perform two distinct roles: The first is to accept and propagate user actions and the second to process trade information in real time and flag suspicious events. To perform the second role effectively, we want the whole history of trade events for each user to reside in memory of a single server. This means we have to pass messages between the servers, since the server that accepts the user action may not have his trade history. To keep the roles loosely coupled, we use Kafka to pass messages between the servers, as you’ll see below.
Kafka has several features that make it a good fit for our requirements: scalability, data partitioning, low latency, and the ability to handle large number of diverse consumers. We have configured Kafka with a single topic for logins and trades. The reason we need a single topic is to make sure that trades arrive to our system after we already have information about the login (so we can make sure the gamer logged in from his usual IP). Kafka maintains order within a topic, but not between topics.
When a user logs in or makes a trade, the accepting server immediately sends the event into Kafka. We send messages with the user id as the key, and the event as the value. This guarantees that all trades and logins from the same user arrive to the same Kafka partition. Each event processing server runs a Kafka consumer, each of which is configured to be part of the same group—this way, each server reads data from few Kafka partitions, and all the data about a particular user arrives to the same event processing server (which can be different from the accepting server). When the event-processing server reads a user trade from Kafka, it adds the event to the user’s event history it caches in local memory. Then it can access the user’s event history from the local cache and flag suspicious events without additional network or disk overhead. Source: https://mesosphere.com/blog/smack-stack-new-lamp-stack/