We are living in a fast pace with data collected everywhere all the time – raw data is useless without a thorough understanding which comes from in-depth analysis and interpretation. For this, the data needs to be accessed with high availability and be integrated in different systems and applications. The market is full of various technologies and systems, most of them shifting the playground towards cloud. The advantages of this infrastructure option are well known: high availability, cheaper costs for scaling and faster upgrade times.
In this article we shall approach Amazon Web Services and their cloud technologies, with accent on the DynamoDB, API Gateway and Lambda Functions. These three services can be used to develop an API for the access to data and provide fast and scalable statistics.
To begin with, DynamoDB is a nonrelational, fully managed and scalable database that provides low-latency data access and an out of the box REST API used for querying and managing the data. Though it has some limitations when it comes to high volumes, DynamoDB is very useful when we want to provide fast access and high availability for the data.
Using the API gateway, DynamoDB’s API can be exposed and simplified for easier access for other applications and users who want to obtain data in a fast, secured and managed manner. In this way we can wrap the querying endpoints of the DynamoDB with more intuitive ones and hide the technical aspects, so the final user can access those endpoints without having a very detailed knowledge on the DynamoDB technology and concepts. Also, we can easily manage the access to the database, by only managing the authorisation roles used by the API Gateway integrations.
For example, we needed to expose data through a REST endpoint, with data available only in a Data Warehouse. Developing an entire REST layer over the database wouldn’t have been the most viable solution. We chose to store the data, as it was only a selective set of it, in the DynamoDB because it came with an out of the box API and could be easily wrapped by Lambda Functions and API Gateway. It was put in place a process that synchronises the data between the two Data Sources on a daily basis, using an ETL developed with SSIS and the integration between the aforementioned AWS services. This way, we have excluded the process of deployment to a server for the REST API, gained easier maintenance procedures and higher flexibility when it comes to future implementations and modifications of the table.
There were challenges, like the limitation of the BatchWrite endpoint of the DynamoDB that limits the number of actions to 25 per batch, but in the end, with better Write Capacity Provisioned configuration we succeeded the export of data to DynamoDB without having to implement complex logic in the SSIS Custom Destination.
- To send data to DynamoDB using SSIS, a third party adapter or an custom Script Task Destination is required, as data needs to be in the Dynamo JSON Format if the API Gateway is integrated directly with the DynamoDB. Should the integration be made through a Lambda Function, a normal JSON format can be used. More information related to the DynamoDB JSON Format can be found in the official documentation, here.
- To increase the writing performance in the DynamoDB, use the BatchWrite endpoint, as it can decrease the writing times of up to 3 times versus PutItem endpoint (based on a 200.000 records volume of data)
- Auto scaling starts if the set threshold of the WCU (Write Capacity Units) is passed constantly for up to 5 minutes. When writing big volumes of data, if the WCU is not enough, the process will need either an retry mechanism (take care when mapping the DynamoDB response to the API Gateway endpoint response) or a step that sets the WCU to a value that allows the process to run without bottlenecks before the writing of the data begins. This value can be observed in the metrics as the average WCU used during the run of the process. Also, the update of the WCU is limited to 4 times a day
As a conclusion, AWS provides interesting technologies that allow the implementation of serverless services with fast response times, high scalability, easy management and interconnectivity, all with the comfort of choosing different widespread programming languages. Cloud is for some time the new infrastructure reality and it is here to stay, with all the pros and constantly reducing the cons.
by Alex Puiu, October 2018