Our Blog

Practical Apache Spark in 10 minutes. Part 4 - SQL

Practical Apache Spark in 10 minutes. Part 4 - SQL

Spark SQL is a part of Apache Spark big data framework designed for processing structured and semi-structured data. It provides a DataFrame API that simplifies and accelerates data manipulations. This abstraction is designed for sampling, filtering, aggregating, and visualizing the data. In this article, we will show you how to construct and work with DataFrames with the help of Spark SQL and pyspark.

Read more
Practical Apache Spark in 10 minutes. Part 3 - Data Frames

Practical Apache Spark in 10 minutes. Part 3 - Data Frames

Please meet the next article in the Practical Apache Spark series. Here we will talk about Data Frames. DataFrame in Spark is a distributed collection of data, organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. Moving gradually on the tutorial we will discuss all the main sources DataFrames can be constructed from: structured data files, tables in Hive, external databases, or existing RDDs.

Read more
Practical Apache Spark in 10 minutes. Part 2 - RDD

Practical Apache Spark in 10 minutes. Part 2 - RDD

The second article in the line of tutorials will guide you through the Apache Spark's primary abstraction. It is a distributed collection of items called a Resilient Distributed Dataset (RDD). It is a fault-tolerant collection of elements which allows parallel operations upon itself. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. The tutorial will be helpful to understand how to create and work with RDD as well as show you small practical task to clearly understand how to work with the data in Spark.

Read more
Practical Apache Spark in 10 minutes. Part 1 - Ubuntu installation

Practical Apache Spark in 10 minutes. Part 1 - Ubuntu installation

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. With this article, we begin a series of blog posts to walk through the Practical Apache Spark for your tasks. The first tutorial is written to guide a user through the installation process on Ubuntu Machine and will show you a step-by-step walkthrough including all information and files you need to start working with Spark.

Read more
Installation and running Ubuntu Virtual Box

Installation and running Ubuntu Virtual Box

Oracle VM VirtualBox - a suite of applications, system services and drivers that emulate the new computer equipment in the environment of the operating system where you installed VirtualBox. On a virtual machine can be installed almost any operating system. Here is a quick guide which will help you to install functional Ubuntu Virtual Box on your computer and become a great help in your work and studies.

Read more